COMP5625M Practical Assessment - Deep Learning [100 Marks]¶

This assessment is divided into two parts:
- Image classification using DNN and CNN [70 Marks]
- Use of RNN to predict texts for image captioning [30 Marks]
The maximum number of marks for each part is shown in the section headers. As indicated in the main heading above, the overall assessment carries a maximum of 100 marks.
This summative assessment is weighted 50% of the final grade for the module.
Motivation¶
Through this coursework, you will:
- Understand and implement your first deep neural network and convolutional neural network (CNN) and see how these can be used for classification problem
- Practice building, evaluating, and finetuning your CNN on an image dataset from development to testing stage.
- You will learn to tackle overfitting problem using strategies such as data augmentation and drop out.
- Compare your model performance and accuracy with others, such as the leaderboard on Kaggle
- Use RNNs to predict the caption of an image from established word vocabularies
- Understand and visualise text predictions for a given image.
Setup and resources¶
You must work using this provided template notebook.
Having a GPU will speed up the training process. See the provided document on Minerva about setting up a working environment for various ways to access a GPU.
Please implement the coursework using Python and PyTorch, and refer to the notebooks and exercises provided.
Submission¶
Please submit the following:
- Your completed Jupyter notebook file, without removing anything in the template, in .ipynb format.
- The .html version of your notebook; File > Download as > HTML (.html). Check that all cells have been run and all outputs (including all graphs you would like to be marked) displayed in the .html for marking.
- Your selected image from section 2.4.2 "Failure analysis"
Final note:
Please display everything that you would like to be marked. Under each section, put the relevant code containing your solution. You may re-use functions you defined previously, but any new code must be in the relevant section. Feel free to add as many code cells as you need under each section.
Your student username (for example, sc15jb):
mm23ap
Your full name:
Asish Panda
Part I: Image Classification [70 marks]¶
Dataset¶
This coursework will use a subset of images from Tiny ImageNet, which is a subset of the ImageNet dataset. Our subset of Tiny ImageNet contains 30 different categories, we will refer to it as TinyImageNet30. The training set has 450 resized images (64x64 pixels) for each category (13,500 images in total). You can download the training and test set from the Kaggle website:
Direct access of data is possible by clicking here, please use your university email to access this
To submit your results on the Kaggle competition. You can also access data here
To access the dataset, you will need an account on the Kaggle website. Even if you have an existing Kaggle account, please carefully adhere to these instructions, or we may not be able to locate your entries:
- Use your university email to register a new account.
- Set your Kaggle account NAME to your university username, for example,
sc15jb(see thenotebelow)
Note: If the name is already taken in the Kaggle then please use a similar pseudo name and add a note in your submission with the name you have used in the Kaggle.
Submitting your test result to Kaggle leaderboard¶
The class Kaggle competition also includes a blind test set, which will be used in Question 1 for evaluating your custom model's performance on a test set. The competition website will compute the test set accuracy, as well as position your model on the class leaderboard. More information is provided in the related section below.
Required packages¶
[1] numpy is package for scientific computing with python
[2] h5py is package to interact with compactly stored dataset
[3] matplotlib can be used for plotting graphs in python
[4] pytorch is library widely used for bulding deep-learning frameworks
Feel free to add to this section as needed some examples for importing some libraries is provided for you below.
import cv2
import math
import numpy as np
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms
from torch.hub import load_state_dict_from_url
from PIL import Image
import matplotlib.pyplot as plt
# always check your version
print(torch.__version__)
2.2.1
One challenge of building a deep learning model is to choose an architecture that can learn the features in the dataset without being unnecessarily complex. The first part of the coursework involves building a CNN and training it on TinyImageNet30.
Overview of image classification:¶
1. Function implementation [14 marks]
- 1.1 PyTorch
DatasetandDataLoaderclasses (4 marks) - 1.2 PyTorch
Modelclass for a simple MLP model (4 marks) - 1.3 PyTorch
Modelclass for a simple CNN model (6 marks)
2. Model training [30 marks]
- 2.1 Training on TinyImageNet30 dataset (6 marks)
- 2.2 Generating confusion matrices and ROC curves (6 marks)
- 2.3 Strategies for tackling overfitting (18 marks)
- 2.3.1 Data augmentation
- 2.3.2 Dropout
- 2.3.3 Hyperparameter tuning (e.g. changing learning rate)
3. Model testing [10 marks]
- 3.1 Testing your final model in (2) on test set - code to do this (4 marks)
- 3.2 Uploading your result to Kaggle (6 marks)
4. Model Fine-tuning on CIFAR10 dataset [16 marks]
- 4.1 Fine-tuning your model (initialise your model with pretrained weights from (2)) (6 marks)
- 4.2 Fine-tuning model with frozen base convolution layers (6 marks)
- 4.3 Compare complete model retraining with pretrained weights and with frozen layers. Comment on what you observe. (4 marks)
1 Function implementations [14 marks]¶
# TO COMPLETE
import os
from torch.utils.data import Dataset, DataLoader, random_split
from PIL import Image
import torch.optim as optim
from torch.nn import functional as F
torch.manual_seed(0)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
if torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(device)
mps
class TinyImage30Dataset(Dataset):
def __init__(self, directory, transform=None):
self.directory = directory
self.transform = transform
self.classes = [f for f in os.listdir(directory) if not f.startswith('.') and not f.endswith('.txt')]
self.classes.sort()
self.class_to_idx = {self.classes[i]: i for i in range(len(self.classes))}
self.samples = []
for class_name in self.classes:
class_dir = os.path.join(directory, class_name)
for img_name in os.listdir(class_dir):
if img_name.endswith('.JPEG') or img_name.endswith('.png'):
self.samples.append((os.path.join(class_dir, img_name), self.class_to_idx[class_name]))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
img_path, target = self.samples[idx]
image = Image.open(img_path).convert('RGB')
if self.transform is not None:
image = self.transform(image)
return image, target
transform = transforms.Compose([
transforms.Resize(64),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
train_dir = 'data/train_set'
test_dir = 'data/test_set'
full_dataset = TinyImage30Dataset(directory=train_dir, transform=transform)
full_dataset_size = len(full_dataset)
test_size = int(0.2 * full_dataset_size) # 20% for testing
train_size = full_dataset_size - test_size
# Split the dataset
train_dataset, val_dataset = random_split(full_dataset, [train_size, test_size])
# Creating DataLoader instances for the train and validation sets
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
for i, (images,labels) in enumerate(val_loader):
print(f"Batch {i} in val_loader: images shape {images.shape}, labels shape {labels.shape}")
if i == 1: # Just to check the first couple of batches
break
Batch 0 in val_loader: images shape torch.Size([32, 3, 64, 64]), labels shape torch.Size([32]) Batch 1 in val_loader: images shape torch.Size([32, 3, 64, 64]), labels shape torch.Size([32])
for i, (images, labels) in enumerate(train_loader):
print(f"Batch {i} in train_loader: images shape {images.shape}, labels shape {labels.shape}")
if i == 1: # Just to check the first couple of batches
break
Batch 0 in train_loader: images shape torch.Size([32, 3, 64, 64]), labels shape torch.Size([32]) Batch 1 in train_loader: images shape torch.Size([32, 3, 64, 64]), labels shape torch.Size([32])
1.2 Define a MLP model class (4 marks)¶
Create a new model class using a combination of:
- Input Units
- Hidden Units
- Output Units
- Activation functions
- Loss function
- Optimiser
# TO COMPLETE
# define a MLP Model class
class MLP(nn.Module):
def __init__(self, input_units, hidden_units, output_units):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_units, hidden_units)
self.activation1 = nn.Sigmoid()
self.fc2 = nn.Linear(hidden_units, 256)
self.fc3 = nn.Linear(256, output_units)
def forward(self, x):
x = x.view(-1, 3 * 64 * 64)
x = self.activation1(self.fc1(x))
x = self.activation1(self.fc2(x))
x = self.fc3(x)
return x
input_units = 3 * 64 * 64
hidden_units = 512
output_units = 30
# Instantiate the model
mlp_model = MLP(input_units=input_units, hidden_units=hidden_units, output_units=output_units)
1.3 Define a CNN model class (6 marks)¶
Create a new model class using a combination of:
- Convolution layers
- Activation functions (e.g. ReLU)
- Maxpooling layers
- Fully connected layers
- Loss function
- Optimiser
Please note that the network should be at least a few layers for the model to perform well.
# TO COMPLETE
# define a CNN Model class
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1) # Use 3 for RGB images
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.relu = nn.ReLU()
self.maxpool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 16 * 16, 128)
self.fc2 = nn.Linear(128, 30)
def forward(self, x):
x = self.maxpool(self.relu(self.conv1(x)))
x = self.maxpool(self.relu(self.conv2(x)))
x = x.view(-1, 64 * 16 * 16) # Flatten the output
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
cnn_model = SimpleCNN()
2 Model training [30 marks]¶
2.1 Train both MLP and CNN model - show loss and accuracy graphs side by side (6 marks)¶
Train your model on the TinyImageNet30 dataset. Split the data into train and validation sets to determine when to stop training. Use seed at 0 for reproducibility and test_ratio=0.2 (validation data)
Display the graph of training and validation loss over epochs and accuracy over epochs to show how you determined the optimal number of training epochs. A top-k accuracy implementation is provided for you below.
Please leave the graph clearly displayed. Please use the same graph to plot graphs for both train and validation.
# (HelperDL function) -- Define top-*k* accuracy (**new**)
def topk_accuracy(output, target, topk=(1,)):
"""Computes the precision@k for the specified values of k"""
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0)
res.append(correct_k.mul_(100.0 / batch_size))
return res
#TO COMPLETE --> Running your MLP model class
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=25):
# Set the device to GPU if available
model = model.to(device)
# Track loss and accuracy values for plotting
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
for epoch in range(num_epochs):
model.train() # Set model to training mode
running_loss = 0.0
running_corrects = 0
# Training phase
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# Backward + optimize
loss.backward()
optimizer.step()
# Statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
epoch_loss = running_loss / len(train_loader.dataset)
epoch_acc = running_corrects.float() / len(train_loader.dataset)
# Validation phase
model.eval() # Set model to evaluate mode
val_loss = 0.0
val_corrects = 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# Statistics
val_loss += loss.item() * inputs.size(0)
val_corrects += torch.sum(preds == labels.data)
val_epoch_loss = val_loss / len(val_loader.dataset)
val_epoch_acc = val_corrects.float() / len(val_loader.dataset)
print(f'Epoch {epoch}/{num_epochs - 1} - '
f'Train Loss: {epoch_loss:.4f} - Train Acc: {epoch_acc:.4f} - '
f'Val Loss: {val_epoch_loss:.4f} - Val Acc: {val_epoch_acc:.4f}')
# Record the loss and accuracy
history['train_loss'].append(epoch_loss)
history['val_loss'].append(val_epoch_loss)
history['train_acc'].append(epoch_acc.item())
history['val_acc'].append(val_epoch_acc.item())
return model, history
def plot_training_history(history):
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history['train_loss'], label='Train Loss')
plt.plot(history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Loss over Epochs')
plt.subplot(1, 2, 2)
plt.plot(history['train_acc'], label='Train Accuracy')
plt.plot(history['val_acc'], label='Validation Accuracy')
plt.legend()
plt.title('Accuracy over Epochs')
plt.show()
# Your graph
mlp_criterion = nn.CrossEntropyLoss()
mlp_optimizer = optim.Adam(mlp_model.parameters(), lr=1e-3)
# Train MLP model
mlp_model, mlp_history = train_model(mlp_model, train_loader, val_loader, mlp_criterion, mlp_optimizer, num_epochs=20)
# Plot training history for MLP
plot_training_history(mlp_history)
Epoch 0/19 - Train Loss: 2.6939 - Train Acc: 0.2243 - Val Loss: 2.8339 - Val Acc: 0.1915 Epoch 1/19 - Train Loss: 2.6307 - Train Acc: 0.2367 - Val Loss: 2.8290 - Val Acc: 0.2044 Epoch 2/19 - Train Loss: 2.5779 - Train Acc: 0.2494 - Val Loss: 2.8140 - Val Acc: 0.2033 Epoch 3/19 - Train Loss: 2.5259 - Train Acc: 0.2661 - Val Loss: 2.8211 - Val Acc: 0.2078 Epoch 4/19 - Train Loss: 2.4765 - Train Acc: 0.2756 - Val Loss: 2.7952 - Val Acc: 0.2070 Epoch 5/19 - Train Loss: 2.4260 - Train Acc: 0.2898 - Val Loss: 2.8103 - Val Acc: 0.2133 Epoch 6/19 - Train Loss: 2.3818 - Train Acc: 0.2978 - Val Loss: 2.8415 - Val Acc: 0.2115 Epoch 7/19 - Train Loss: 2.3444 - Train Acc: 0.3153 - Val Loss: 2.8225 - Val Acc: 0.2041 Epoch 8/19 - Train Loss: 2.3065 - Train Acc: 0.3248 - Val Loss: 2.8352 - Val Acc: 0.2144 Epoch 9/19 - Train Loss: 2.2602 - Train Acc: 0.3394 - Val Loss: 2.8225 - Val Acc: 0.2215 Epoch 10/19 - Train Loss: 2.2183 - Train Acc: 0.3480 - Val Loss: 2.8731 - Val Acc: 0.2100 Epoch 11/19 - Train Loss: 2.1710 - Train Acc: 0.3624 - Val Loss: 2.9228 - Val Acc: 0.2037 Epoch 12/19 - Train Loss: 2.1424 - Train Acc: 0.3662 - Val Loss: 2.9097 - Val Acc: 0.2074 Epoch 13/19 - Train Loss: 2.1006 - Train Acc: 0.3849 - Val Loss: 2.9088 - Val Acc: 0.2037 Epoch 14/19 - Train Loss: 2.0632 - Train Acc: 0.3917 - Val Loss: 2.8920 - Val Acc: 0.2141 Epoch 15/19 - Train Loss: 2.0226 - Train Acc: 0.4030 - Val Loss: 2.9400 - Val Acc: 0.2059 Epoch 16/19 - Train Loss: 1.9942 - Train Acc: 0.4092 - Val Loss: 2.9290 - Val Acc: 0.2119 Epoch 17/19 - Train Loss: 1.9525 - Train Acc: 0.4292 - Val Loss: 2.9578 - Val Acc: 0.2119 Epoch 18/19 - Train Loss: 1.8997 - Train Acc: 0.4399 - Val Loss: 3.0027 - Val Acc: 0.2126 Epoch 19/19 - Train Loss: 1.8712 - Train Acc: 0.4487 - Val Loss: 3.0355 - Val Acc: 0.2093
We can see from the above graphs that the model starts to overfit around epoch 15, where the validation loss starts deteriorating.
#TO COMPLETE --> Running your CNN model class
# Do the same for the CNN model with its own criterion and optimizer
cnn_model = SimpleCNN()
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(cnn_model.parameters(), lr=1e-3)
# Train CNN model
cnn_model, cnn_history = train_model(cnn_model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=15)
# Plot training history for CNN
plot_training_history(cnn_history)
Epoch 0/14 - Train Loss: 2.8549 - Train Acc: 0.1851 - Val Loss: 2.5393 - Val Acc: 0.2670 Epoch 1/14 - Train Loss: 2.1982 - Train Acc: 0.3631 - Val Loss: 2.2049 - Val Acc: 0.3619 Epoch 2/14 - Train Loss: 1.7914 - Train Acc: 0.4811 - Val Loss: 2.2211 - Val Acc: 0.3770 Epoch 3/14 - Train Loss: 1.3977 - Train Acc: 0.5821 - Val Loss: 2.3648 - Val Acc: 0.3767 Epoch 4/14 - Train Loss: 1.0044 - Train Acc: 0.7005 - Val Loss: 2.5583 - Val Acc: 0.3878 Epoch 5/14 - Train Loss: 0.6160 - Train Acc: 0.8136 - Val Loss: 3.0013 - Val Acc: 0.3800 Epoch 6/14 - Train Loss: 0.3203 - Train Acc: 0.9092 - Val Loss: 3.6370 - Val Acc: 0.3481 Epoch 7/14 - Train Loss: 0.1845 - Train Acc: 0.9495 - Val Loss: 4.0592 - Val Acc: 0.3633 Epoch 8/14 - Train Loss: 0.1013 - Train Acc: 0.9731 - Val Loss: 4.4402 - Val Acc: 0.3559 Epoch 9/14 - Train Loss: 0.0796 - Train Acc: 0.9784 - Val Loss: 4.9322 - Val Acc: 0.3481 Epoch 10/14 - Train Loss: 0.0981 - Train Acc: 0.9731 - Val Loss: 5.3002 - Val Acc: 0.3404 Epoch 11/14 - Train Loss: 0.1037 - Train Acc: 0.9664 - Val Loss: 5.2639 - Val Acc: 0.3489 Epoch 12/14 - Train Loss: 0.0961 - Train Acc: 0.9710 - Val Loss: 5.8771 - Val Acc: 0.3515 Epoch 13/14 - Train Loss: 0.0632 - Train Acc: 0.9807 - Val Loss: 5.9884 - Val Acc: 0.3467 Epoch 14/14 - Train Loss: 0.0459 - Train Acc: 0.9857 - Val Loss: 6.2075 - Val Acc: 0.3463
We can see the accuracy flatlines at around 7 epochs and the validation accuracy starts decreasing after epoch 8, thus we will create another CNN model to see if our overall validation accuracy can increase from the current one.
class SimpleCNN_Improved(nn.Module):
def __init__(self, num_classes=30):
super(SimpleCNN_Improved, self).__init__()
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
self.fc1 = nn.Linear(in_features=128 * 8 * 8, out_features=512)
self.fc2 = nn.Linear(in_features=512, out_features=num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, 128 * 8 * 8) # Flatten the layer before feeding it into fully connected layers
x = F.leaky_relu(self.fc1(x))
x = self.fc2(x)
return x
model = SimpleCNN_Improved()
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Train CNN model
model, cnn_history = train_model(model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=15)
# Plot training history for CNN
plot_training_history(cnn_history)
Epoch 0/14 - Train Loss: 2.7937 - Train Acc: 0.1975 - Val Loss: 2.3241 - Val Acc: 0.3070 Epoch 1/14 - Train Loss: 2.1819 - Train Acc: 0.3719 - Val Loss: 2.1370 - Val Acc: 0.3952 Epoch 2/14 - Train Loss: 1.7740 - Train Acc: 0.4769 - Val Loss: 2.0051 - Val Acc: 0.4389 Epoch 3/14 - Train Loss: 1.3587 - Train Acc: 0.5910 - Val Loss: 2.0714 - Val Acc: 0.4456 Epoch 4/14 - Train Loss: 0.8894 - Train Acc: 0.7256 - Val Loss: 2.4527 - Val Acc: 0.4300 Epoch 5/14 - Train Loss: 0.4503 - Train Acc: 0.8567 - Val Loss: 2.8047 - Val Acc: 0.4322 Epoch 6/14 - Train Loss: 0.2298 - Train Acc: 0.9315 - Val Loss: 3.4132 - Val Acc: 0.4241 Epoch 7/14 - Train Loss: 0.1582 - Train Acc: 0.9534 - Val Loss: 3.8183 - Val Acc: 0.4048 Epoch 8/14 - Train Loss: 0.1112 - Train Acc: 0.9647 - Val Loss: 4.4554 - Val Acc: 0.4126 Epoch 9/14 - Train Loss: 0.1018 - Train Acc: 0.9687 - Val Loss: 4.3927 - Val Acc: 0.4067 Epoch 10/14 - Train Loss: 0.1135 - Train Acc: 0.9643 - Val Loss: 4.7010 - Val Acc: 0.4004 Epoch 11/14 - Train Loss: 0.1121 - Train Acc: 0.9641 - Val Loss: 4.8048 - Val Acc: 0.4196 Epoch 12/14 - Train Loss: 0.0837 - Train Acc: 0.9763 - Val Loss: 5.0345 - Val Acc: 0.4037 Epoch 13/14 - Train Loss: 0.0773 - Train Acc: 0.9749 - Val Loss: 4.8666 - Val Acc: 0.4163 Epoch 14/14 - Train Loss: 0.0809 - Train Acc: 0.9754 - Val Loss: 5.0010 - Val Acc: 0.3922
We can observe that at around epoch 9 our training accuracy stops improving while our validation accuracy starts decreasing, thus we will only train our model till epoch 8 for the optimum performance
model = SimpleCNN_Improved()
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Train CNN model
model, _ = train_model(model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=8)
Epoch 0/7 - Train Loss: 2.7822 - Train Acc: 0.2060 - Val Loss: 2.4333 - Val Acc: 0.2956 Epoch 1/7 - Train Loss: 2.2124 - Train Acc: 0.3576 - Val Loss: 2.2194 - Val Acc: 0.3670 Epoch 2/7 - Train Loss: 1.8412 - Train Acc: 0.4616 - Val Loss: 2.1156 - Val Acc: 0.4130 Epoch 3/7 - Train Loss: 1.4542 - Train Acc: 0.5688 - Val Loss: 2.0799 - Val Acc: 0.4293 Epoch 4/7 - Train Loss: 0.9894 - Train Acc: 0.6999 - Val Loss: 2.2935 - Val Acc: 0.4270 Epoch 5/7 - Train Loss: 0.5306 - Train Acc: 0.8375 - Val Loss: 2.7998 - Val Acc: 0.4326 Epoch 6/7 - Train Loss: 0.2612 - Train Acc: 0.9191 - Val Loss: 3.3461 - Val Acc: 0.4204 Epoch 7/7 - Train Loss: 0.1456 - Train Acc: 0.9555 - Val Loss: 3.8282 - Val Acc: 0.4111
Comment on your model and the results you have obtained. This should include the number of parameters for each of your models and briefly explain why one should use CNN over MLP for the image classification problem.
MLP Model¶
- The MLP (Multilayer Perceptron) model consists of three fully connected layers with Sigmoid activations for the first two layers and a CrossEntropyLoss criterion to handle multi-class classification.
- The model takes an input flattened from a 64x64 RGB image, resulting in a 12,288-dimensional input vector. This vector is then connected to a hidden layer with 512 units, followed by another hidden layer with 256 units, and finally to an output layer with 30 units corresponding to the number of classes.
- The performance of the MLP model on the training set improves over epochs. However, like the CNN model, the validation accuracy remains low and does not improve, while the validation loss increases, indicating overfitting.
CNN Model¶
- The CNN model consists of three convolutional layers with max pooling and ReLU activations, followed by two fully connected layers.
- Convolutional layers are designed to automatically and adaptively learn spatial hierarchies of features from input images, which is beneficial for image classification tasks.
- The CNN model exhibits similar overfitting trends as the MLP model, with high training accuracy but low validation accuracy that does not improve.
Number of Parameters:¶
- For the MLP:
- First fully connected layer: (3∗64∗64)∗512+512(3∗64∗64)∗512+512 (weights + biases)
- Second fully connected layer: 512∗256+256512∗256+256
- Third fully connected layer: 256∗30+30256∗30+30
- For the CNN:
- First convolutional layer: (3∗3∗3)∗32+32(3∗3∗3)∗32+32 (weights + biases)
- Second convolutional layer: (32∗3∗3)∗64+64(32∗3∗3)∗64+64
- Third convolutional layer: (64∗3∗3)∗128+128(64∗3∗3)∗128+128
- First fully connected layer: (128∗8∗8)∗512+512(128∗8∗8)∗512+512
- Second fully connected layer: 512∗30+30512∗30+30
Why Use CNN Over MLP for Image Classification:¶
- Parameter Efficiency: CNNs require significantly fewer parameters than MLPs because they take advantage of the 2D structure of input data, unlike MLPs, which require a separate parameter for each input feature.
- Feature Learning: CNNs can learn to recognize spatial hierarchies in images, capturing the presence of complex structures like edges and patterns within local receptive fields, which is essential for image recognition.
- Translation Invariance: The pooling layers in CNNs make the detection of features somewhat invariant to the location in the image.
- Generalization: CNNs are generally better at generalizing from training data to unseen data (when properly regularized), which is crucial for classification tasks with varied and complex datasets.
2.2 Generating confusion matrix and ROC curves (6 marks)¶
- Use your CNN architecture with best accuracy to generate two confusion matrices, one for the training set and another for the validation set. Remember to use the whole validation and training sets, and to include all your relevant code. Display the confusion matrices in a meaningful way which clearly indicates what percentage of the data is represented in each position.
- Display an ROC curve for the two top and two bottom classes with area under the curve
# Your code here!
from sklearn.metrics import confusion_matrix, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import label_binarize
from itertools import cycle
# Function to get predictions and true labels
def get_all_preds_labels(loader, model):
all_preds = []
all_labels = []
model = model.to(device).eval()
with torch.no_grad():
for inputs, labels in loader:
inputs, labels = inputs.float().to(device), labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
all_preds.append(preds.cpu().numpy())
all_labels.append(labels.cpu().numpy())
# Concatenate all batches
all_preds = np.concatenate(all_preds)
all_labels = np.concatenate(all_labels)
return all_preds, all_labels
train_preds, train_labels = get_all_preds_labels(train_loader, model)
val_preds, val_labels = get_all_preds_labels(val_loader, model)
print(f"Train predictions shape: {train_preds.shape}, Train labels shape: {train_labels.shape}")
print(f"Validation predictions shape: {val_preds.shape}, Validation labels shape: {val_labels.shape}")
Train predictions shape: (10800,), Train labels shape: (10800,) Validation predictions shape: (2700,), Validation labels shape: (2700,)
train_conf_mat = confusion_matrix(train_labels, train_preds)
val_conf_mat = confusion_matrix(val_labels, val_preds)
# Plotting the confusion matrix
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.heatmap(train_conf_mat, annot=True, ax=ax[0], fmt='g')
ax[0].set_title('Training Set Confusion Matrix')
ax[0].set_xlabel('Predicted labels')
ax[0].set_ylabel('True labels')
sns.heatmap(val_conf_mat, annot=True, ax=ax[1], fmt='g')
ax[1].set_title('Validation Set Confusion Matrix')
ax[1].set_xlabel('Predicted labels')
ax[1].set_ylabel('True labels')
plt.show()
import torch
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
def get_predictions(loader, model, device='cpu'):
model = model.eval().to(device)
all_probs = []
prediction_labels = []
true_labels = []
with torch.no_grad():
for inputs, labels in loader:
inputs = inputs.to(device).float()
labels = labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
all_probs.append(outputs.cpu())
prediction_labels.extend(preds.cpu())
true_labels.extend(labels.cpu())
all_probs = torch.vstack(all_probs).numpy()
prediction_labels = np.array(prediction_labels)
true_labels = np.array(true_labels)
return all_probs, prediction_labels, true_labels
def compute_class_roc_auc(probs, labels_binarized, num_classes):
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(num_classes):
fpr[i], tpr[i], _ = roc_curve(labels_binarized[:, i], probs[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
return fpr, tpr, roc_auc
def plot_class_roc_curve(class_id, train_fpr, train_tpr, train_roc_auc, val_fpr, val_tpr, val_roc_auc):
plt.figure(figsize=(8, 6))
plt.plot(train_fpr[class_id], train_tpr[class_id], label=f'Train Class {class_id} (AUC = {train_roc_auc[class_id]:.2f})', color='blue', linestyle='-')
plt.plot(val_fpr[class_id], val_tpr[class_id], label=f'Val Class {class_id} (AUC = {val_roc_auc[class_id]:.2f})', color='red', linestyle='--')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve for Class {class_id}')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
train_probs, train_preds, train_labels = get_predictions(train_loader, model, device)
val_probs, val_preds, val_labels = get_predictions(val_loader, model, device)
# Binarize the labels for ROC computation
num_classes = model.fc2.out_features
train_labels_binarized = label_binarize(train_labels, classes=range(num_classes))
val_labels_binarized = label_binarize(val_labels, classes=range(num_classes))
# Compute ROC AUC for each class in both training and validation sets
train_fpr, train_tpr, train_roc_auc = compute_class_roc_auc(train_probs, train_labels_binarized, num_classes)
val_fpr, val_tpr, val_roc_auc = compute_class_roc_auc(val_probs, val_labels_binarized, num_classes)
# Find the top two and bottom two classes based on training set AUC
sorted_auc_indices = np.argsort([train_roc_auc[i] for i in range(num_classes)])
top_classes = sorted_auc_indices[-2:] # Top two classes
bottom_classes = sorted_auc_indices[:2] # Bottom two classes
# Plot ROC curves for the top and bottom classes
selected_classes = np.concatenate((top_classes, bottom_classes))
for class_id in selected_classes:
plot_class_roc_curve(class_id, train_fpr, train_tpr, train_roc_auc, val_fpr, val_tpr, val_roc_auc)
Redesign your CNN model (optional)¶
This is optional and does not carry any marks. Often to tackle model underfitting we tend to make more complex network design. Depending on your observation, you can improve your model if you wish.
class SimpleCNN_Improved(nn.Module):
def __init__(self, num_classes=30):
super(SimpleCNN_Improved, self).__init__()
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
self.fc1 = nn.Linear(in_features=128 * 8 * 8, out_features=512)
self.fc2 = nn.Linear(in_features=512, out_features=num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, 128 * 8 * 8) # Flatten the layer before feeding it into fully connected layers
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
model = SimpleCNN_Improved()
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Train CNN model
model, cnn_history = train_model(model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=15)
# Plot training history for CNN
plot_training_history(cnn_history)
Note: All questions below here relates to the CNN model only and not an MLP model! You are advised to use your final CNN model only for each of the questions below.
2.3 Strategies for tackling overfitting (18 marks)¶
Using your (final) CNN model perform the strategies below to avoid overfitting problems. You can resuse the network weights from previous training, often referred to as fine tuning.
- 2.3.1 Data augmentation
- 2.3.2 Dropout
- 2.3.3 Hyperparameter tuning (e.g. changing learning rate)
Plot loss and accuracy graphs per epoch side by side for each implemented strategy.
2.3.1 Data augmentation (6 marks)¶
Implement at least five different data augmentation techniques that should include both photometric and geometric augmentations.
Provide graphs and comment on what you observe.
# Your code here!
data_transforms = transforms.Compose([
transforms.Resize(256),
transforms.ToTensor(),
transforms.RandomHorizontalFlip(p=0.5), # Geometric: Horizontally flip the image with a 50% probability
transforms.RandomVerticalFlip(p=0.5), # Geometric: Vertically flip the image with a 50% probability
transforms.RandomRotation(30), # Geometric: Rotate the image up to 30 degrees
transforms.RandomAffine(15), # Geometric: Apply a random affine transformation
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2), # Color: Adjust brightness, contrast, saturation, and hue
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
new_full_dataset = TinyImage30Dataset(directory=train_dir, transform=data_transforms)
full_dataset_size = len(full_dataset)
test_size = int(0.2 * full_dataset_size) # 20% for testing
train_size = full_dataset_size - test_size
# Split the dataset
train_dataset, val_dataset = random_split(full_dataset, [train_size, test_size])
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
cnn_model = SimpleCNN_Improved()
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(cnn_model.parameters(), lr=1e-3)
# Train CNN model
cnn_model, cnn_history = train_model(cnn_model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=9)
# Plot training history for CNN
plot_training_history(cnn_history)
Epoch 0/8 - Train Loss: 2.8338 - Train Acc: 0.1864 - Val Loss: 2.5358 - Val Acc: 0.2667 Epoch 1/8 - Train Loss: 2.2262 - Train Acc: 0.3569 - Val Loss: 2.2390 - Val Acc: 0.3544 Epoch 2/8 - Train Loss: 1.8493 - Train Acc: 0.4519 - Val Loss: 2.1058 - Val Acc: 0.4019 Epoch 3/8 - Train Loss: 1.4602 - Train Acc: 0.5621 - Val Loss: 2.2445 - Val Acc: 0.4059 Epoch 4/8 - Train Loss: 0.9990 - Train Acc: 0.6918 - Val Loss: 2.3583 - Val Acc: 0.4141 Epoch 5/8 - Train Loss: 0.5796 - Train Acc: 0.8196 - Val Loss: 2.9492 - Val Acc: 0.4048 Epoch 6/8 - Train Loss: 0.2992 - Train Acc: 0.9071 - Val Loss: 3.8187 - Val Acc: 0.3711 Epoch 7/8 - Train Loss: 0.2114 - Train Acc: 0.9349 - Val Loss: 4.1484 - Val Acc: 0.3848 Epoch 8/8 - Train Loss: 0.1561 - Train Acc: 0.9532 - Val Loss: 4.5645 - Val Acc: 0.3804
Observations:¶
- The training loss decreases rapidly, which indicates that the model is learning effectively from the augmented data.
- The training accuracy increases steadily, reaching above 95%, which is a good sign that the model fits well with the training data.
- However, the validation loss does not decrease as hoped; instead, it increases slightly, suggesting the model needs to be generalised to unseen data.
- The validation accuracy is relatively low and remains flat, which may point to overfitting despite the augmentations.
- The discrepancy between training and validation metrics suggests the model needs to be generalised better. This could be due to several factors, such as model architecture, the complexity of the dataset, insufficient regularisation, or the fact that the augmentations need to be more diverse.
2.3.2 Dropout (6 marks)¶
Implement dropout in your model
Provide graphs and comment on your choice of proportion used.
class SimpleCNNWithDropout(nn.Module):
def __init__(self, num_classes=30):
super(SimpleCNNWithDropout, self).__init__()
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
self.dropout1 = nn.Dropout(0.25) # Dropout layer after convolutional layers
self.fc1 = nn.Linear(in_features=128 * 8 * 8, out_features=512)
self.dropout2 = nn.Dropout(0.5) # Dropout layer before the last fully connected layer
self.fc2 = nn.Linear(in_features=512, out_features=num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = self.dropout1(x)
x = x.view(-1, 128 * 8 * 8) # Flatten the layer before feeding it into fully connected layers
x = F.leaky_relu(self.fc1(x))
x = self.dropout2(x)
x = self.fc2(x)
return x
# Model instantiation
model = SimpleCNNWithDropout(num_classes=30)
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(model.parameters(), lr=1e-4)
# Train CNN model
model, cnn_history = train_model(model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=25)
# Plot training history for CNN
plot_training_history(cnn_history)
Epoch 0/24 - Train Loss: 3.1297 - Train Acc: 0.1181 - Val Loss: 2.7180 - Val Acc: 0.2285 Epoch 1/24 - Train Loss: 2.6810 - Train Acc: 0.2289 - Val Loss: 2.5420 - Val Acc: 0.2819 Epoch 2/24 - Train Loss: 2.4710 - Train Acc: 0.2881 - Val Loss: 2.4367 - Val Acc: 0.3263 Epoch 3/24 - Train Loss: 2.3241 - Train Acc: 0.3277 - Val Loss: 2.3587 - Val Acc: 0.3456 Epoch 4/24 - Train Loss: 2.2193 - Train Acc: 0.3498 - Val Loss: 2.3165 - Val Acc: 0.3578 Epoch 5/24 - Train Loss: 2.1295 - Train Acc: 0.3769 - Val Loss: 2.2572 - Val Acc: 0.3841 Epoch 6/24 - Train Loss: 2.0383 - Train Acc: 0.4063 - Val Loss: 2.2690 - Val Acc: 0.3915 Epoch 7/24 - Train Loss: 1.9542 - Train Acc: 0.4286 - Val Loss: 2.2637 - Val Acc: 0.4000 Epoch 8/24 - Train Loss: 1.8777 - Train Acc: 0.4524 - Val Loss: 2.1868 - Val Acc: 0.4122 Epoch 9/24 - Train Loss: 1.8062 - Train Acc: 0.4663 - Val Loss: 2.1669 - Val Acc: 0.4252 Epoch 10/24 - Train Loss: 1.7266 - Train Acc: 0.4931 - Val Loss: 2.1714 - Val Acc: 0.4374 Epoch 11/24 - Train Loss: 1.6555 - Train Acc: 0.5084 - Val Loss: 2.1339 - Val Acc: 0.4433 Epoch 12/24 - Train Loss: 1.5901 - Train Acc: 0.5304 - Val Loss: 2.1996 - Val Acc: 0.4381 Epoch 13/24 - Train Loss: 1.5215 - Train Acc: 0.5472 - Val Loss: 2.1772 - Val Acc: 0.4456 Epoch 14/24 - Train Loss: 1.4555 - Train Acc: 0.5693 - Val Loss: 2.1778 - Val Acc: 0.4448 Epoch 15/24 - Train Loss: 1.3904 - Train Acc: 0.5847 - Val Loss: 2.2055 - Val Acc: 0.4522 Epoch 16/24 - Train Loss: 1.3491 - Train Acc: 0.5973 - Val Loss: 2.1664 - Val Acc: 0.4663 Epoch 17/24 - Train Loss: 1.2765 - Train Acc: 0.6151 - Val Loss: 2.2494 - Val Acc: 0.4556 Epoch 18/24 - Train Loss: 1.2209 - Train Acc: 0.6302 - Val Loss: 2.3033 - Val Acc: 0.4641 Epoch 19/24 - Train Loss: 1.1713 - Train Acc: 0.6442 - Val Loss: 2.2752 - Val Acc: 0.4641 Epoch 20/24 - Train Loss: 1.1129 - Train Acc: 0.6587 - Val Loss: 2.2974 - Val Acc: 0.4678 Epoch 21/24 - Train Loss: 1.0334 - Train Acc: 0.6875 - Val Loss: 2.3196 - Val Acc: 0.4759 Epoch 22/24 - Train Loss: 0.9994 - Train Acc: 0.6956 - Val Loss: 2.3351 - Val Acc: 0.4630 Epoch 23/24 - Train Loss: 0.9415 - Train Acc: 0.7118 - Val Loss: 2.4161 - Val Acc: 0.4644 Epoch 24/24 - Train Loss: 0.8825 - Train Acc: 0.7258 - Val Loss: 2.4114 - Val Acc: 0.4630
Graph Analysis:¶
- Training Loss: There is a steady decline in training loss, which shows the model is learning and improving its ability to fit the training data over epochs.
- Validation Loss: The validation loss declines initially but starts to plateau and increase slightly towards the later epochs. This could suggest the beginning of overfitting or that the model has learned as much as it can from the training data given the current architecture and data.
- Training Accuracy: The training accuracy continuously increases, indicating consistent learning throughout the epochs.
- Validation Accuracy: The validation accuracy improves but plateaus around 46-47%. This suggests that while the model has learned patterns from the training data, its ability to generalize them to new data is limited.
Comments on Dropout Rates Used:¶
- Convolutional layers typically have fewer parameters than fully connected layers, making them less likely to overfit. However, some dropouts are still beneficial to promoting feature robustness. Therefore, a 25% dropout rate is chosen after the convolutional layers.
- The choice of 50% for dropout before the last fully connected layer is quite common in practice, especially for more extensive networks, to significantly reduce overfitting risks.
2.3.3 Hyperparameter tuning (6 marks)¶
Use learning rates [0.1, 0.001, 0.0001].
Provide graphs each for loss and accuracy at three different learning rates in a single graph.
cnn_model = SimpleCNNWithDropout()
cnn_optimizer = optim.Adam(cnn_model.parameters(), lr=0.1)
# Train CNN model
cnn_model, cnn_history_01 = train_model(cnn_model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=25)
Epoch 0/24 - Train Loss: 1848.7105 - Train Acc: 0.0337 - Val Loss: 3.4324 - Val Acc: 0.0341 Epoch 1/24 - Train Loss: 3.4565 - Train Acc: 0.0322 - Val Loss: 3.4218 - Val Acc: 0.0330 Epoch 2/24 - Train Loss: 3.4581 - Train Acc: 0.0362 - Val Loss: 3.4445 - Val Acc: 0.0348 Epoch 3/24 - Train Loss: 3.4713 - Train Acc: 0.0307 - Val Loss: 3.4350 - Val Acc: 0.0311 Epoch 4/24 - Train Loss: 3.4702 - Train Acc: 0.0373 - Val Loss: 3.4238 - Val Acc: 0.0289 Epoch 5/24 - Train Loss: 3.4738 - Train Acc: 0.0343 - Val Loss: 3.4184 - Val Acc: 0.0315 Epoch 6/24 - Train Loss: 3.4866 - Train Acc: 0.0311 - Val Loss: 3.4830 - Val Acc: 0.0348 Epoch 7/24 - Train Loss: 3.4941 - Train Acc: 0.0342 - Val Loss: 3.4506 - Val Acc: 0.0311 Epoch 8/24 - Train Loss: 3.4868 - Train Acc: 0.0356 - Val Loss: 3.4512 - Val Acc: 0.0304 Epoch 9/24 - Train Loss: 3.5054 - Train Acc: 0.0325 - Val Loss: 3.4545 - Val Acc: 0.0293 Epoch 10/24 - Train Loss: 3.4931 - Train Acc: 0.0336 - Val Loss: 3.4689 - Val Acc: 0.0337 Epoch 11/24 - Train Loss: 3.4914 - Train Acc: 0.0349 - Val Loss: 3.4547 - Val Acc: 0.0270 Epoch 12/24 - Train Loss: 3.5036 - Train Acc: 0.0328 - Val Loss: 3.4533 - Val Acc: 0.0348 Epoch 13/24 - Train Loss: 3.4914 - Train Acc: 0.0355 - Val Loss: 3.4605 - Val Acc: 0.0311 Epoch 14/24 - Train Loss: 3.5053 - Train Acc: 0.0336 - Val Loss: 3.4482 - Val Acc: 0.0348 Epoch 15/24 - Train Loss: 3.5112 - Train Acc: 0.0335 - Val Loss: 3.6300 - Val Acc: 0.0337 Epoch 16/24 - Train Loss: 3.5301 - Train Acc: 0.0305 - Val Loss: 3.4459 - Val Acc: 0.0404 Epoch 17/24 - Train Loss: 3.5146 - Train Acc: 0.0363 - Val Loss: 3.4729 - Val Acc: 0.0322 Epoch 18/24 - Train Loss: 3.5152 - Train Acc: 0.0336 - Val Loss: 3.4840 - Val Acc: 0.0348 Epoch 19/24 - Train Loss: 3.5244 - Train Acc: 0.0323 - Val Loss: 3.4794 - Val Acc: 0.0337 Epoch 20/24 - Train Loss: 3.5172 - Train Acc: 0.0297 - Val Loss: 3.4413 - Val Acc: 0.0337 Epoch 21/24 - Train Loss: 3.5085 - Train Acc: 0.0344 - Val Loss: 3.5022 - Val Acc: 0.0270 Epoch 22/24 - Train Loss: 3.5309 - Train Acc: 0.0328 - Val Loss: 3.5290 - Val Acc: 0.0289 Epoch 23/24 - Train Loss: 3.5178 - Train Acc: 0.0317 - Val Loss: 3.4716 - Val Acc: 0.0270 Epoch 24/24 - Train Loss: 3.5132 - Train Acc: 0.0329 - Val Loss: 3.4412 - Val Acc: 0.0341
cnn_model = SimpleCNNWithDropout()
# Your graph
cnn_optimizer = optim.Adam(cnn_model.parameters(), lr=0.01)
# Train CNN model
cnn_model, cnn_history_001 = train_model(cnn_model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=25)
Epoch 0/24 - Train Loss: 3.4997 - Train Acc: 0.0331 - Val Loss: 3.4085 - Val Acc: 0.0337 Epoch 1/24 - Train Loss: 3.4059 - Train Acc: 0.0325 - Val Loss: 3.4077 - Val Acc: 0.0315 Epoch 2/24 - Train Loss: 3.4078 - Train Acc: 0.0320 - Val Loss: 3.4029 - Val Acc: 0.0348 Epoch 3/24 - Train Loss: 3.4069 - Train Acc: 0.0303 - Val Loss: 3.4080 - Val Acc: 0.0307 Epoch 4/24 - Train Loss: 3.4066 - Train Acc: 0.0299 - Val Loss: 3.4065 - Val Acc: 0.0289 Epoch 5/24 - Train Loss: 3.4062 - Train Acc: 0.0319 - Val Loss: 3.4053 - Val Acc: 0.0289 Epoch 6/24 - Train Loss: 3.4069 - Train Acc: 0.0315 - Val Loss: 3.4055 - Val Acc: 0.0289 Epoch 7/24 - Train Loss: 3.4062 - Train Acc: 0.0294 - Val Loss: 3.4047 - Val Acc: 0.0307 Epoch 8/24 - Train Loss: 3.4075 - Train Acc: 0.0321 - Val Loss: 3.4053 - Val Acc: 0.0289 Epoch 9/24 - Train Loss: 3.4068 - Train Acc: 0.0334 - Val Loss: 3.4043 - Val Acc: 0.0307 Epoch 10/24 - Train Loss: 3.4076 - Train Acc: 0.0319 - Val Loss: 3.4067 - Val Acc: 0.0307 Epoch 11/24 - Train Loss: 3.4080 - Train Acc: 0.0340 - Val Loss: 3.4053 - Val Acc: 0.0348 Epoch 12/24 - Train Loss: 3.4066 - Train Acc: 0.0329 - Val Loss: 3.4059 - Val Acc: 0.0293 Epoch 13/24 - Train Loss: 3.4062 - Train Acc: 0.0306 - Val Loss: 3.4067 - Val Acc: 0.0315 Epoch 14/24 - Train Loss: 3.4068 - Train Acc: 0.0325 - Val Loss: 3.4066 - Val Acc: 0.0270 Epoch 15/24 - Train Loss: 3.4073 - Train Acc: 0.0334 - Val Loss: 3.4068 - Val Acc: 0.0270 Epoch 16/24 - Train Loss: 3.4074 - Train Acc: 0.0327 - Val Loss: 3.4045 - Val Acc: 0.0311 Epoch 17/24 - Train Loss: 3.4062 - Train Acc: 0.0319 - Val Loss: 3.4059 - Val Acc: 0.0289 Epoch 18/24 - Train Loss: 3.4081 - Train Acc: 0.0311 - Val Loss: 3.4042 - Val Acc: 0.0307 Epoch 19/24 - Train Loss: 3.4068 - Train Acc: 0.0313 - Val Loss: 3.4123 - Val Acc: 0.0270 Epoch 20/24 - Train Loss: 3.4081 - Train Acc: 0.0337 - Val Loss: 3.4037 - Val Acc: 0.0315 Epoch 21/24 - Train Loss: 3.4070 - Train Acc: 0.0322 - Val Loss: 3.4060 - Val Acc: 0.0289 Epoch 22/24 - Train Loss: 3.4065 - Train Acc: 0.0350 - Val Loss: 3.4131 - Val Acc: 0.0293 Epoch 23/24 - Train Loss: 3.4069 - Train Acc: 0.0316 - Val Loss: 3.4034 - Val Acc: 0.0293 Epoch 24/24 - Train Loss: 3.4084 - Train Acc: 0.0328 - Val Loss: 3.4093 - Val Acc: 0.0289
cnn_model = SimpleCNNWithDropout()
cnn_optimizer = optim.Adam(cnn_model.parameters(), lr=0.0001)
# Train CNN model
cnn_model, cnn_history_0001 = train_model(cnn_model, train_loader, val_loader, cnn_criterion, cnn_optimizer, num_epochs=25)
Epoch 0/24 - Train Loss: 3.1343 - Train Acc: 0.1222 - Val Loss: 2.7691 - Val Acc: 0.2059 Epoch 1/24 - Train Loss: 2.7091 - Train Acc: 0.2233 - Val Loss: 2.5356 - Val Acc: 0.2804 Epoch 2/24 - Train Loss: 2.4989 - Train Acc: 0.2765 - Val Loss: 2.4855 - Val Acc: 0.3104 Epoch 3/24 - Train Loss: 2.3616 - Train Acc: 0.3140 - Val Loss: 2.4030 - Val Acc: 0.3270 Epoch 4/24 - Train Loss: 2.2428 - Train Acc: 0.3488 - Val Loss: 2.2973 - Val Acc: 0.3556 Epoch 5/24 - Train Loss: 2.1350 - Train Acc: 0.3831 - Val Loss: 2.2465 - Val Acc: 0.3785 Epoch 6/24 - Train Loss: 2.0420 - Train Acc: 0.4020 - Val Loss: 2.2486 - Val Acc: 0.3852 Epoch 7/24 - Train Loss: 1.9549 - Train Acc: 0.4302 - Val Loss: 2.2105 - Val Acc: 0.3933 Epoch 8/24 - Train Loss: 1.8816 - Train Acc: 0.4512 - Val Loss: 2.1862 - Val Acc: 0.4152 Epoch 9/24 - Train Loss: 1.7959 - Train Acc: 0.4756 - Val Loss: 2.1605 - Val Acc: 0.4193 Epoch 10/24 - Train Loss: 1.7205 - Train Acc: 0.4959 - Val Loss: 2.1412 - Val Acc: 0.4407 Epoch 11/24 - Train Loss: 1.6645 - Train Acc: 0.5053 - Val Loss: 2.1560 - Val Acc: 0.4341 Epoch 12/24 - Train Loss: 1.6008 - Train Acc: 0.5219 - Val Loss: 2.1342 - Val Acc: 0.4519 Epoch 13/24 - Train Loss: 1.5419 - Train Acc: 0.5414 - Val Loss: 2.1207 - Val Acc: 0.4496 Epoch 14/24 - Train Loss: 1.4654 - Train Acc: 0.5669 - Val Loss: 2.1517 - Val Acc: 0.4552 Epoch 15/24 - Train Loss: 1.3979 - Train Acc: 0.5812 - Val Loss: 2.2454 - Val Acc: 0.4496 Epoch 16/24 - Train Loss: 1.3533 - Train Acc: 0.5931 - Val Loss: 2.1678 - Val Acc: 0.4530 Epoch 17/24 - Train Loss: 1.2842 - Train Acc: 0.6134 - Val Loss: 2.1698 - Val Acc: 0.4707 Epoch 18/24 - Train Loss: 1.2256 - Train Acc: 0.6323 - Val Loss: 2.1835 - Val Acc: 0.4622 Epoch 19/24 - Train Loss: 1.1527 - Train Acc: 0.6504 - Val Loss: 2.2631 - Val Acc: 0.4659 Epoch 20/24 - Train Loss: 1.1021 - Train Acc: 0.6673 - Val Loss: 2.2578 - Val Acc: 0.4637 Epoch 21/24 - Train Loss: 1.0433 - Train Acc: 0.6778 - Val Loss: 2.2775 - Val Acc: 0.4630 Epoch 22/24 - Train Loss: 0.9917 - Train Acc: 0.6971 - Val Loss: 2.3262 - Val Acc: 0.4667 Epoch 23/24 - Train Loss: 0.9411 - Train Acc: 0.7102 - Val Loss: 2.4103 - Val Acc: 0.4607 Epoch 24/24 - Train Loss: 0.8877 - Train Acc: 0.7276 - Val Loss: 2.4124 - Val Acc: 0.4674
plt.figure(figsize=(14, 5))
# Plot training accuracy
plt.subplot(1, 2, 1)
plt.plot(cnn_history_01['train_acc'], label='LR=0.1')
plt.plot(cnn_history_001['train_acc'], label='LR=0.01')
plt.plot(cnn_history_0001['train_acc'], label='LR=0.001')
plt.title('Training Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
# Plot validation accuracy
plt.subplot(1, 2, 2)
plt.plot(cnn_history_01['val_acc'], label='LR=0.1')
plt.plot(cnn_history_001['val_acc'], label='LR=0.01')
plt.plot(cnn_history_0001['val_acc'], label='LR=0.001')
plt.title('Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
3 Model testing [10 marks]¶
Online evaluation of your model performance on the test set.
Prepare the dataloader for the testset.
Write evaluation code for writing predictions.
Upload it to Kaggle submission page (6 marks)
class TinyImage30TestDataset(Dataset):
def __init__(self, directory, transform=None):
self.directory = directory
self.transform = transform
# Load only JPEG and PNG images and sort them to maintain order
self.images = sorted(
[img for img in os.listdir(directory) if img.endswith('.JPEG') or img.endswith('.png')]
)
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
# Get the image path and load it
img_name = self.images[idx]
img_path = os.path.join(self.directory, img_name)
image = Image.open(img_path).convert('RGB')
# Apply the transformation
if self.transform is not None:
image = self.transform(image)
# Return the image tensor and its corresponding filename
return image, img_name
test_transform = transforms.Compose([
transforms.Resize(256),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
test_dataset = TinyImage30TestDataset(directory=test_dir, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
3.2 Prepare your submission and upload to Kaggle (6 marks)¶
Save all test predictions to a CSV file and submit it to the private class Kaggle competition. Please save your test CSV file submissions using your student username (the one with letters, e.g., sc15jb, not the ID with only numbers), for example, sc15jb.csv. That will help us to identify your submissions.
The CSV file must contain only two columns: ‘Id’ and ‘Category’ (predicted class ID) as shown below:
txt
Id,Category
28d0f5e9_373c.JPEG,2
bbe4895f_40bf.JPEG,18
The ‘Id’ column should include the name of the image. It is important to keep the same name as the one on the test set. Do not include any path, just the name of file (with extension). Your csv file must contain 1501 rows, one for each image on test set and 1 row for the headers. To submit please click here.
You may submit multiple times. We will use your personal top entry for allocating marks for this [6 marks].
# Your code here!
import pandas as pd
import torch
# Set the model to evaluation mode
model.eval()
test_ids = []
pred_categories = []
with torch.no_grad():
for data in test_loader:
images, ids = data
images = images.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
test_ids.extend(ids)
pred_categories.extend(predicted.cpu().numpy().tolist())
# Create a DataFrame with the results
submission_df = pd.DataFrame({
'Id': test_ids,
'Category': pred_categories
})
submission_file_name = 'mm23ap.csv'
submission_df.to_csv(submission_file_name, index=False)
4 Model Fine-tuning/transfer learning on CIFAR10 dataset [16 marks]¶
Fine-tuning is a way of applying or utilizing transfer learning. It is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task. You can perform fine-tuning in following fashion:
Train an entire model: Start training model from scratch (large dataset, more computation)
Train some layers, freeze others: Lower layer features are general (problem independent) while higher layer features are specific (problem dependent – freeze)
Freeze convolution base and train only last FC layers (small dataset and lower computation)
Configuring your dataset
- Download your dataset using
torchvision.datasets.CIFAR10explained here - Split training dataset into training and validation set similar to above. Note that the number of categories here is only 10
# Your code here!
cifar10_train = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_size = int(0.8 * len(cifar10_train))
val_size = len(cifar10_train) - train_size
train_dataset, val_dataset = random_split(cifar10_train, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
Files already downloaded and verified
Load pretrained AlexNet from PyTorch - use model copies to apply transfer learning in different configurations
# Your code here!
alexnet = torchvision.models.alexnet(pretrained=True)
alexnet.classifier[6] = nn.Linear(alexnet.classifier[6].in_features, 10)
alexnet = alexnet.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(alexnet.classifier.parameters(), lr=0.001)
4.1 Apply transfer learning with pretrained model weights (6 marks)¶
Configuration 1: No frozen layers
# Your model changes here - also print trainable parameters
print("Trainable parameters in the model: \n")
for name, param in alexnet.named_parameters():
if param.requires_grad:
print(name, param.data.shape)
print("======================== \n")
alexnet, alexnet_history = train_model(alexnet, train_loader, val_loader, criterion, optimizer, num_epochs=11)
plot_training_history(alexnet_history)
Trainable parameters in the model: features.0.weight torch.Size([64, 3, 11, 11]) features.0.bias torch.Size([64]) features.3.weight torch.Size([192, 64, 5, 5]) features.3.bias torch.Size([192]) features.6.weight torch.Size([384, 192, 3, 3]) features.6.bias torch.Size([384]) features.8.weight torch.Size([256, 384, 3, 3]) features.8.bias torch.Size([256]) features.10.weight torch.Size([256, 256, 3, 3]) features.10.bias torch.Size([256]) classifier.1.weight torch.Size([4096, 9216]) classifier.1.bias torch.Size([4096]) classifier.4.weight torch.Size([4096, 4096]) classifier.4.bias torch.Size([4096]) classifier.6.weight torch.Size([10, 4096]) classifier.6.bias torch.Size([10]) ======================== Epoch 0/10 - Train Loss: 1.4292 - Train Acc: 0.5176 - Val Loss: 1.1246 - Val Acc: 0.6076 Epoch 1/10 - Train Loss: 1.2463 - Train Acc: 0.5739 - Val Loss: 1.0793 - Val Acc: 0.6240 Epoch 2/10 - Train Loss: 1.1725 - Train Acc: 0.5981 - Val Loss: 1.0868 - Val Acc: 0.6136 Epoch 3/10 - Train Loss: 1.1495 - Train Acc: 0.6095 - Val Loss: 1.0164 - Val Acc: 0.6505 Epoch 4/10 - Train Loss: 1.1187 - Train Acc: 0.6218 - Val Loss: 1.0467 - Val Acc: 0.6452 Epoch 5/10 - Train Loss: 1.0963 - Train Acc: 0.6284 - Val Loss: 1.0064 - Val Acc: 0.6572 Epoch 6/10 - Train Loss: 1.0777 - Train Acc: 0.6348 - Val Loss: 1.0002 - Val Acc: 0.6613 Epoch 7/10 - Train Loss: 1.0630 - Train Acc: 0.6414 - Val Loss: 1.0648 - Val Acc: 0.6354 Epoch 8/10 - Train Loss: 1.0500 - Train Acc: 0.6459 - Val Loss: 1.0172 - Val Acc: 0.6588 Epoch 9/10 - Train Loss: 1.0452 - Train Acc: 0.6517 - Val Loss: 0.9864 - Val Acc: 0.6685 Epoch 10/10 - Train Loss: 1.0227 - Train Acc: 0.6598 - Val Loss: 0.9897 - Val Acc: 0.6693
4.2 Fine-tuning model with frozen layers (6 marks)¶
Configuration 2: Frozen base convolution blocks
# Your changes here - also print trainable parameters
alexnet_frozen = torchvision.models.alexnet(pretrained=True)
for param in alexnet_frozen.features.parameters():
param.requires_grad = False
print("Trainable parameters in the model: \n")
for name, param in alexnet_frozen.named_parameters():
if param.requires_grad:
print(name, param.data.shape)
print("======================== \n")
alexnet_frozen.classifier[6] = nn.Linear(alexnet.classifier[6].in_features, 10)
alexnet_frozen = alexnet_frozen.to(device)
optimizer = optim.Adam(alexnet_frozen.classifier.parameters(), lr=0.001)
classifier.1.weight torch.Size([4096, 9216]) classifier.1.bias torch.Size([4096]) classifier.4.weight torch.Size([4096, 4096]) classifier.4.bias torch.Size([4096]) classifier.6.weight torch.Size([1000, 4096]) classifier.6.bias torch.Size([1000]) ========================
alexnet_frozen, alexnet_frozen_history = train_model(alexnet_frozen, train_loader, val_loader, criterion, optimizer, num_epochs=11)
plot_training_history(alexnet_frozen_history)
Epoch 0/10 - Train Loss: 1.4120 - Train Acc: 0.5217 - Val Loss: 1.1114 - Val Acc: 0.6065 Epoch 1/10 - Train Loss: 1.2398 - Train Acc: 0.5773 - Val Loss: 1.1116 - Val Acc: 0.6191 Epoch 2/10 - Train Loss: 1.1833 - Train Acc: 0.5986 - Val Loss: 1.0244 - Val Acc: 0.6420 Epoch 3/10 - Train Loss: 1.1456 - Train Acc: 0.6108 - Val Loss: 1.0388 - Val Acc: 0.6447 Epoch 4/10 - Train Loss: 1.1160 - Train Acc: 0.6245 - Val Loss: 1.0322 - Val Acc: 0.6518 Epoch 5/10 - Train Loss: 1.1008 - Train Acc: 0.6324 - Val Loss: 1.0259 - Val Acc: 0.6561 Epoch 6/10 - Train Loss: 1.0798 - Train Acc: 0.6358 - Val Loss: 1.0239 - Val Acc: 0.6467 Epoch 7/10 - Train Loss: 1.0666 - Train Acc: 0.6426 - Val Loss: 1.0156 - Val Acc: 0.6619 Epoch 8/10 - Train Loss: 1.0532 - Train Acc: 0.6493 - Val Loss: 1.0106 - Val Acc: 0.6589 Epoch 9/10 - Train Loss: 1.0396 - Train Acc: 0.6548 - Val Loss: 0.9837 - Val Acc: 0.6722 Epoch 10/10 - Train Loss: 1.0256 - Train Acc: 0.6598 - Val Loss: 0.9881 - Val Acc: 0.6736
4.3 Compare above configurations and comment on performances. (4 marks)¶
# Your graphs here and please provide comment in markdown in another cell
train_losses_no_freeze = alexnet_history['train_loss']
val_losses_no_freeze = alexnet_history['val_loss']
train_losses_frozen = alexnet_frozen_history['train_loss']
val_losses_frozen = alexnet_frozen_history['val_loss']
train_accuracies_no_freeze = alexnet_history['train_acc']
val_accuracies_no_freeze = alexnet_history['val_acc']
train_accuracies_frozen = alexnet_frozen_history['train_acc']
val_accuracies_frozen = alexnet_frozen_history['val_acc']
epochs = range(1, len(train_losses_no_freeze) + 1)
plt.figure(figsize=(14, 7))
plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses_no_freeze, label='Training Loss - No Freeze')
plt.plot(epochs, val_losses_no_freeze, label='Validation Loss - No Freeze')
plt.plot(epochs, train_losses_frozen, label='Training Loss - Frozen Base')
plt.plot(epochs, val_losses_frozen, label='Validation Loss - Frozen Base')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(epochs, train_accuracies_no_freeze, label='Training Accuracy - No Freeze')
plt.plot(epochs, val_accuracies_no_freeze, label='Validation Accuracy - No Freeze')
plt.plot(epochs, train_accuracies_frozen, label='Training Accuracy - Frozen Base')
plt.plot(epochs, val_accuracies_frozen, label='Validation Accuracy - Frozen Base')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
From the loss graph, we can observe that:¶
- The model with the frozen base starts with a higher training loss, but the loss decreases more smoothly than without frozen layers.
- The validation loss for the frozen base model is consistently lower than that for the model without a freeze, which suggests that freezing the base is helping with generalization.
From the accuracy graph, it appears that:¶
- The model with the frozen base has a smoother increase in training accuracy and, after initial epochs, outperforms the model with no frozen layers regarding validation accuracy.
- The model with no freeze has a higher variance in validation accuracy, potentially indicating overfitting to the training data or instability in learning.
Conclusions:¶
- The model with the frozen base layers seems to generalize better on the validation set, indicated by a lower validation loss and higher validation accuracy.
- The smoother curves for the frozen base model suggest that it is learning more steadily and may be less prone to overfitting than the model with all layers trainable.
- The initial higher training loss for the frozen base model could be because the pre-trained features are not a perfect fit for the new dataset. However, as the fully connected layers adapt, the model quickly improves.
Part II: Image Captioning using RNN [30 marks]¶
Motivation¶
Through this part of assessment you will:
- Understand the principles of text pre-processing and vocabulary building (provided).
- Gain experience working with an image to text model.
- Use and compare text similarity metrics for evaluating an image to text model, and understand evaluation challenges.
Dataset¶
This assessment will use a subset of the COCO "Common Objects in Context" dataset for image caption generation. COCO contains 330K images, of 80 object categories, and at least five textual reference captions per image. Our subset consists of nearly 5070 of these images, each of which has five or more different descriptions of the salient entities and activities, and we will refer to it as COCO_5070.
To download the data:
- Images and annotations: download the zipped file provided in the link here as
COMP5625M_data_assessment_2.zip.
Info only: To understand more about the COCO dataset you can look at the download page. We have already provided you with the "2017 Train/Val annotations (241MB)" but our image subset consists of fewer images compared to orginial COCO dataset. So, no need to download anything from here!
- Image meta data: as our set is a subset of full COCO dataset, we have created a CSV file containing relevant meta data for our particular subset of images. You can download it also from Drive, "coco_subset_meta.csv" at the same link as 1.
Submission¶
You can either submit the same file or make a two separate .ipython notebook files zipped in the submission (please name as yourstudentusername_partI.ipynb and yourstudentusername_partII.ipynb).
Final note:
Please include in this notebook everything that you would like to be marked, including figures. Under each section, put the relevant code containing your solution. You may re-use functions you defined previously, but any new code must be in the relevant section. Feel free to add as many code cells as you need under each section.
The basic principle of our image-to-text model is as pictured in the diagram below, where an Encoder network encodes the input image as a feature vector by providing the outputs of the last fully-connected layer of a pre-trained CNN (we use ResNet50). This pretrained network has been trained on the complete ImageNet dataset and is thus able to recognise common objects.
You can alternatively use the COCO trained pretrained weights from PyTorch. One way to do this is use the "FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1" but use e.g., "resnet_model = model.backbone.body". Alternatively, you can use the checkpoint from your previous coursework where you fine-tuned to COCO dataset.
These features are then fed into a Decoder network along with the reference captions. As the image feature dimensions are large and sparse, the Decoder network includes a linear layer which downsizes them, followed by a batch normalisation layer to speed up training. Those resulting features, as well as the reference text captions, are then passed into a recurrent network (we will use an RNNs in this assessment).
The reference captions used to compute loss are represented as numerical vectors via an embedding layer whose weights are learned during training.

Instructions for creating vocabulary¶
A helper function file helperDL.py is provided that includes all the functions that will do the following for you. You can easily import these functions in the exercise, most are already done for you!
- Extracting image features (a trained checkpoint is provided
resnet50_caption.ptfor you to download and use it for training your RNN)- Text preparation of training and validation data is provided
import torch
import torch.nn as nn
from torchvision import transforms
import torchvision.models as models
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import matplotlib.pyplot as plt
from collections import Counter
import os
import numpy as np
Please refer to the submission section at the top of this notebook to prepare your submission.
device = torch.device('cuda' if torch.cuda.is_available() else 'mps')
print('Using device:', device)
Using device: mps
# Load the feature map provided to you
features_map = torch.load('data/resnet50_caption.pt', map_location=device)
5 Train DecoderRNN [20 marks]¶
5.1 Design a RNN-based decoder (10 marks)
5.2 Train your model with precomputed features (10 Marks)
5.1 Design a RNN-based decoder (10 marks)¶
Read through the DecoderRNN model below. First, complete the decoder by adding an rnn layer to the decoder where indicated, using the PyTorch API as reference.
Keep all the default parameters except for batch_first, which you may set to True.
In particular, understand the meaning of pack_padded_sequence() as used in forward(). Refer to the PyTorch pack_padded_sequence() documentation.
import json
import pandas as pd
with open('data/COMP5625M_data_assessment_2/coco/annotations2017/captions_train2017.json', 'r') as json_file:
data = json.load(json_file)
df = pd.DataFrame.from_dict(data["annotations"])
df.head()
| image_id | id | caption | |
|---|---|---|---|
| 0 | 203564 | 37 | A bicycle replica with a clock as the front wh... |
| 1 | 322141 | 49 | A room with blue walls and a white sink and door. |
| 2 | 16977 | 89 | A car that seems to be parked illegally behind... |
| 3 | 106140 | 98 | A large passenger airplane flying through the ... |
| 4 | 106140 | 101 | There is a GOL plane taking off in a partly cl... |
coco_subset = pd.read_csv("data/COMP5625M_data_assessment_2/coco_subset_meta.csv")
new_data = pd.DataFrame( data['annotations'])
new_coco = coco_subset.rename(columns={'id':'image_id'})
new_coco.drop_duplicates('file_name',keep='first',inplace=True)
new_subset = new_data.sort_values(['image_id'], ascending=True)
# Get all the reference captions
new_file = pd.merge(new_coco,new_subset,how = 'inner', on = 'image_id')
new_file = new_file[['image_id','id','caption','file_name']]
new_file = new_file.sort_values(['image_id'], ascending = True)
new_file.head(10)
| image_id | id | caption | file_name | |
|---|---|---|---|---|
| 16755 | 9 | 661611 | Closeup of bins of food that include broccoli ... | 000000000009.jpg |
| 16759 | 9 | 661977 | A meal is presented in brightly colored plasti... | 000000000009.jpg |
| 16758 | 9 | 663627 | there are containers filled with different kin... | 000000000009.jpg |
| 16757 | 9 | 666765 | Colorful dishes holding meat, vegetables, frui... | 000000000009.jpg |
| 16756 | 9 | 667602 | A bunch of trays that have different food. | 000000000009.jpg |
| 9634 | 25 | 127238 | A giraffe mother with its baby in the forest. | 000000000025.jpg |
| 9633 | 25 | 127076 | A giraffe standing up nearby a tree | 000000000025.jpg |
| 9635 | 25 | 133058 | Two giraffes standing in a tree filled area. | 000000000025.jpg |
| 9636 | 25 | 133676 | A giraffe standing next to a forest filled wit... | 000000000025.jpg |
| 9637 | 25 | 122312 | A giraffe eating food from the top of the tree. | 000000000025.jpg |
# getting the clearn clean - e.g., converting all uppercases to lowercases
new_file["clean_caption"] = ""
from helperDL import gen_clean_captions_df
new_file = gen_clean_captions_df(new_file)
new_file.head(10)
| image_id | id | caption | file_name | clean_caption | |
|---|---|---|---|---|---|
| 16755 | 9 | 661611 | Closeup of bins of food that include broccoli ... | 000000000009.jpg | closeup of bins of food that include broccoli ... |
| 16759 | 9 | 661977 | A meal is presented in brightly colored plasti... | 000000000009.jpg | a meal is presented in brightly colored plasti... |
| 16758 | 9 | 663627 | there are containers filled with different kin... | 000000000009.jpg | there are containers filled with different kin... |
| 16757 | 9 | 666765 | Colorful dishes holding meat, vegetables, frui... | 000000000009.jpg | colorful dishes holding meat vegetables fruit ... |
| 16756 | 9 | 667602 | A bunch of trays that have different food. | 000000000009.jpg | a bunch of trays that have different food |
| 9634 | 25 | 127238 | A giraffe mother with its baby in the forest. | 000000000025.jpg | a giraffe mother with its baby in the forest |
| 9633 | 25 | 127076 | A giraffe standing up nearby a tree | 000000000025.jpg | a giraffe standing up nearby a tree |
| 9635 | 25 | 133058 | Two giraffes standing in a tree filled area. | 000000000025.jpg | two giraffes standing in a tree filled area |
| 9636 | 25 | 133676 | A giraffe standing next to a forest filled wit... | 000000000025.jpg | a giraffe standing next to a forest filled wit... |
| 9637 | 25 | 122312 | A giraffe eating food from the top of the tree. | 000000000025.jpg | a giraffe eating food from the top of the tree |
##### Spilt your training, validation and test dataset with indexes to each set
from helperDL import split_ids
train_id, valid_id, test_id = split_ids(new_file['image_id'].unique())
print("training:{}, validation:{}, test:{}".format(len(train_id), len(valid_id), len(test_id)))
training:3547, validation:506, test:1015
train_set = new_file.loc[new_file['image_id'].isin(train_id)]
valid_set = new_file.loc[new_file['image_id'].isin(valid_id)]
test_set = new_file.loc[new_file['image_id'].isin(test_id)]
class Vocabulary(object):
""" Simple vocabulary wrapper which maps every unique word to an integer ID. """
def __init__(self):
# intially, set both the IDs and words to dictionaries with special tokens
self.word2idx = {'<pad>': 0, '<unk>': 1, '<end>': 2}
self.idx2word = {0: '<pad>', 1: '<unk>', 2: '<end>'}
self.idx = 3
def add_word(self, word):
# if the word does not already exist in the dictionary, add it
if not word in self.word2idx:
self.word2idx[word] = self.idx
self.idx2word[self.idx] = word
# increment the ID for the next word
self.idx += 1
def __call__(self, word):
# if we try to access a word not in the dictionary, return the id for <unk>
if not word in self.word2idx:
return self.word2idx['<unk>']
return self.word2idx[word]
def __len__(self):
return len(self.word2idx)
### build vocabulariy for each set - train, val and test
# you will be using to create dataloaders
from helperDL import build_vocab
# create a vocab instance
vocab = Vocabulary()
vocab_train = build_vocab(train_id, new_file, vocab)
vocab_valid = build_vocab(valid_id, new_file, vocab)
vocab_test = build_vocab(test_id, new_file, vocab)
vocab = vocab_train # using only training samples as vocabulary as instructed above
print("Total vocabulary size: {}".format(len(vocab_train)))
Total vocabulary size: 2340
# They can also join the train and valid captions but they will need to run vocabulary after concatenation
import numpy as np
vocab = build_vocab(np.concatenate((train_id, valid_id), axis=0), new_file, vocab) #---> overrighting
len(vocab)
2510
Instantiate a DataLoader for your image feature and caption dataset. helperDL.py file includes all the required functions
We need to overwrite the default PyTorch collate_fn() because our ground truth captions are sequential data of varying lengths. The default collate_fn() does not support merging the captions with padding.
You can read more about it here: https://pytorch.org/docs/stable/data.html#dataloader-collate-fn.
from helperDL import EncoderCNN
model = EncoderCNN().to(device)
print(model)
/Users/asishpanda/anaconda3/envs/dl/lib/python3.11/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /Users/asishpanda/anaconda3/envs/dl/lib/python3.11/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights. warnings.warn(msg)
EncoderCNN(
(resnet): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(5): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(6): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(7): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(8): AdaptiveAvgPool2d(output_size=(1, 1))
)
)
##### Preparing the train, val and test dataloaders
from helperDL import COCO_Features
from helperDL import caption_collate_fn
# Create a dataloader for train
dataset_train = COCO_Features(
df=train_set,
vocab=vocab,
features=features_map,
)
train_loader = torch.utils.data.DataLoader(
dataset_train,
batch_size=32,
shuffle=True,
num_workers=0, # may need to set to 0
collate_fn=caption_collate_fn, # explicitly overwrite the collate_fn
)
# Create a dataloader for valid
dataset_valid = COCO_Features(
df=valid_set,
vocab=vocab,
features=features_map,
)
valid_loader = torch.utils.data.DataLoader(
dataset_valid,
batch_size=32,
shuffle=True,
num_workers=0, # may need to set to 0
collate_fn=caption_collate_fn, # explicitly overwrite the collate_fn
)
# say this is as below
# --> Please change these numbers as required.
# --> Please comment on changes that you do.
EMBED_SIZE = 256
HIDDEN_SIZE = 1024 #increased hidden size to caputure more complex features
NUM_LAYERS = 2 #increased number of layers to capture more complex features
LR = 0.001
NUM_EPOCHS = 5
LOG_STEP = 10
MAX_SEQ_LEN = 37
5.1 Design a RNN-based decoder (10 marks)¶
Read through the DecoderRNN model below. First, complete the decoder by adding an rnn layer to the decoder where indicated, using the PyTorch API as reference.
Keep all the default parameters except for batch_first, which you may set to True.
In particular, understand the meaning of pack_padded_sequence() as used in forward(). Refer to the PyTorch pack_padded_sequence() documentation.
class DecoderRNN(nn.Module):
def __init__(self, vocab_size, embed_size=256, hidden_size=512, num_layers=1, max_seq_length=47):
"""Set the hyper-parameters and build the layers."""
super(DecoderRNN, self).__init__()
# we want a specific output size, which is the size of our embedding, so
# we feed our extracted features from the last fc layer (dimensions 1 x 2048)
# into a Linear layer to resize
# your code
self.resize = nn.Linear(2048, embed_size) # Assuming features are 2048 in size
# batch normalisation helps to speed up training
# your code
self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)
# your code for embedding layer
self.embed = nn.Embedding(vocab_size, embed_size)
# your code for RNN
self.rnn = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, dropout=0.5)
self.linear = nn.Linear(hidden_size, vocab_size)
self.max_seq_length = max_seq_length
def forward(self, features, captions, lengths):
"""Decode image feature vectors and generates captions."""
embeddings = self.embed(captions)
im_features = self.resize(features)
im_features = self.bn(im_features)
# compute your feature embeddings
# your code
embeddings = torch.cat((im_features.unsqueeze(1), embeddings), 1)
# pack_padded_sequence returns a PackedSequence object, which contains two items:
# the packed data (data cut off at its true length and flattened into one list), and
# the batch_sizes, or the number of elements at each sequence step in the batch.
# For instance, given data [a, b, c] and [x] the PackedSequence would contain data
# [a, x, b, c] with batch_sizes=[2,1,1].
# your code [hint: use pack_padded_sequence]
packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
hiddens, _ = self.rnn(packed)
outputs = self.linear(hiddens[0]) #hint: use a hidden layers in parenthesis
return outputs
def sample(self, features, states=None):
"""Generate captions for given image features using greedy search."""
sampled_ids = []
inputs = self.bn(self.resize(features)).unsqueeze(1)
for i in range(self.max_seq_length):
hiddens, states = self.rnn(inputs, states) # hiddens: (batch_size, 1, hidden_size)
outputs = self.linear(hiddens.squeeze(1)) # outputs: (batch_size, vocab_size)
_, predicted = outputs.max(1) # predicted: (batch_size)
sampled_ids.append(predicted)
inputs = self.embed(predicted) # inputs: (batch_size, embed_size)
inputs = inputs.unsqueeze(1) # inputs: (batch_size, 1, embed_size)
sampled_ids = torch.stack(sampled_ids, 1) # sampled_ids: (batch_size, max_seq_length)
return sampled_ids
# instantiate decoder
decoder = DecoderRNN(len(vocab), embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS).to(device)
5.2 Train your model with precomputed features (10 marks)¶
Train the decoder by passing the features, reference captions, and targets to the decoder, then computing loss based on the outputs and the targets. Note that when passing the targets and model outputs to the loss function, the targets will also need to be formatted using pack_padded_sequence().
We recommend a batch size of around 64 (though feel free to adjust as necessary for your hardware).
We strongly recommend saving a checkpoint of your trained model after training so you don't need to re-train multiple times.
Display a graph of training and validation loss over epochs to justify your stopping point.
# loss and optimizer here
# your code here --->
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
# Optimizer (using Adam here, which is a common choice)
optimizer = optim.Adam(decoder.parameters(), lr=LR)
# train the models
total_step = len(train_loader)
total_step_v = len(valid_loader)
stats = np.zeros((NUM_EPOCHS,2))
print(stats.shape)
total_loss = 0
for epoch in range(NUM_EPOCHS):
for i, (features_, captions_, lengths_) in enumerate(train_loader):
# your code here --->
features_ = features_.to(device)
captions_ = captions_.to(device)
optimizer.zero_grad()
outputs = decoder(features_, captions_, lengths_)
targets = pack_padded_sequence(captions_, lengths_, batch_first=True)[0]
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
total_loss += loss.item()
# print stats
if i % LOG_STEP == 0:
print(f"Epoch [{epoch+1}/{NUM_EPOCHS}], Step [{i}/{total_step}], Loss: {loss.item():.4f}")
stats[epoch,0] = round(total_loss/total_step,3)
total_loss = 0
decoder.eval()
with torch.no_grad():
for i, (features_, captions_, lengths_) in enumerate(valid_loader):
# your code here --->
features_ = features_.to(device)
captions_ = captions_.to(device)
outputs = decoder(features_, captions_, lengths_)
targets = pack_padded_sequence(captions_, lengths_, batch_first=True)[0]
loss = criterion(outputs, targets)
total_loss += loss.item()
stats[epoch,1] = round(total_loss/total_step_v,3)
total_loss = 0
# print stats
print("="*30)
print(f"Epoch [{epoch+1}/{NUM_EPOCHS}], Train_Loss: {stats[epoch,0]}, Valid_Loss: {stats[epoch,1]}")
print("="*30)
decoder.train()
(5, 2) Epoch [1/5], Step [0/555], Loss: 7.8323 Epoch [1/5], Step [10/555], Loss: 5.7775 Epoch [1/5], Step [20/555], Loss: 5.5456 Epoch [1/5], Step [30/555], Loss: 5.1402 Epoch [1/5], Step [40/555], Loss: 4.8426 Epoch [1/5], Step [50/555], Loss: 4.5375 Epoch [1/5], Step [60/555], Loss: 4.5391 Epoch [1/5], Step [70/555], Loss: 4.1860 Epoch [1/5], Step [80/555], Loss: 4.4630 Epoch [1/5], Step [90/555], Loss: 4.5641 Epoch [1/5], Step [100/555], Loss: 4.3990 Epoch [1/5], Step [110/555], Loss: 4.2333 Epoch [1/5], Step [120/555], Loss: 4.1109 Epoch [1/5], Step [130/555], Loss: 3.8535 Epoch [1/5], Step [140/555], Loss: 4.2680 Epoch [1/5], Step [150/555], Loss: 4.0497 Epoch [1/5], Step [160/555], Loss: 3.8190 Epoch [1/5], Step [170/555], Loss: 4.2665 Epoch [1/5], Step [180/555], Loss: 4.0981 Epoch [1/5], Step [190/555], Loss: 3.9418 Epoch [1/5], Step [200/555], Loss: 3.8745 Epoch [1/5], Step [210/555], Loss: 4.0666 Epoch [1/5], Step [220/555], Loss: 3.8321 Epoch [1/5], Step [230/555], Loss: 3.9420 Epoch [1/5], Step [240/555], Loss: 3.9588 Epoch [1/5], Step [250/555], Loss: 4.0191 Epoch [1/5], Step [260/555], Loss: 3.7670 Epoch [1/5], Step [270/555], Loss: 3.6424 Epoch [1/5], Step [280/555], Loss: 3.5009 Epoch [1/5], Step [290/555], Loss: 3.7440 Epoch [1/5], Step [300/555], Loss: 3.4188 Epoch [1/5], Step [310/555], Loss: 3.5041 Epoch [1/5], Step [320/555], Loss: 3.5363 Epoch [1/5], Step [330/555], Loss: 3.6609 Epoch [1/5], Step [340/555], Loss: 3.4422 Epoch [1/5], Step [350/555], Loss: 3.6402 Epoch [1/5], Step [360/555], Loss: 3.3669 Epoch [1/5], Step [370/555], Loss: 3.8056 Epoch [1/5], Step [380/555], Loss: 3.3362 Epoch [1/5], Step [390/555], Loss: 3.8436 Epoch [1/5], Step [400/555], Loss: 3.4497 Epoch [1/5], Step [410/555], Loss: 3.4688 Epoch [1/5], Step [420/555], Loss: 3.3983 Epoch [1/5], Step [430/555], Loss: 3.4401 Epoch [1/5], Step [440/555], Loss: 3.5434 Epoch [1/5], Step [450/555], Loss: 3.3106 Epoch [1/5], Step [460/555], Loss: 3.2177 Epoch [1/5], Step [470/555], Loss: 3.4550 Epoch [1/5], Step [480/555], Loss: 3.4579 Epoch [1/5], Step [490/555], Loss: 3.3351 Epoch [1/5], Step [500/555], Loss: 3.2925 Epoch [1/5], Step [510/555], Loss: 3.3551 Epoch [1/5], Step [520/555], Loss: 3.3399 Epoch [1/5], Step [530/555], Loss: 3.1980 Epoch [1/5], Step [540/555], Loss: 3.0609 Epoch [1/5], Step [550/555], Loss: 3.2808 ============================== Epoch [1/5], Train_Loss: 3.883, Valid_Loss: 3.21 ============================== Epoch [2/5], Step [0/555], Loss: 3.2043 Epoch [2/5], Step [10/555], Loss: 3.1137 Epoch [2/5], Step [20/555], Loss: 3.2918 Epoch [2/5], Step [30/555], Loss: 2.7880 Epoch [2/5], Step [40/555], Loss: 3.0359 Epoch [2/5], Step [50/555], Loss: 3.0129 Epoch [2/5], Step [60/555], Loss: 3.1451 Epoch [2/5], Step [70/555], Loss: 2.9651 Epoch [2/5], Step [80/555], Loss: 2.8416 Epoch [2/5], Step [90/555], Loss: 3.0892 Epoch [2/5], Step [100/555], Loss: 2.9485 Epoch [2/5], Step [110/555], Loss: 2.9363 Epoch [2/5], Step [120/555], Loss: 2.9763 Epoch [2/5], Step [130/555], Loss: 2.8763 Epoch [2/5], Step [140/555], Loss: 3.1456 Epoch [2/5], Step [150/555], Loss: 3.0137 Epoch [2/5], Step [160/555], Loss: 2.8814 Epoch [2/5], Step [170/555], Loss: 3.2264 Epoch [2/5], Step [180/555], Loss: 2.9429 Epoch [2/5], Step [190/555], Loss: 2.8838 Epoch [2/5], Step [200/555], Loss: 3.1727 Epoch [2/5], Step [210/555], Loss: 2.8753 Epoch [2/5], Step [220/555], Loss: 2.8916 Epoch [2/5], Step [230/555], Loss: 3.0031 Epoch [2/5], Step [240/555], Loss: 2.7639 Epoch [2/5], Step [250/555], Loss: 2.9374 Epoch [2/5], Step [260/555], Loss: 2.9888 Epoch [2/5], Step [270/555], Loss: 3.0276 Epoch [2/5], Step [280/555], Loss: 2.8949 Epoch [2/5], Step [290/555], Loss: 3.0550 Epoch [2/5], Step [300/555], Loss: 3.0733 Epoch [2/5], Step [310/555], Loss: 2.9098 Epoch [2/5], Step [320/555], Loss: 2.7294 Epoch [2/5], Step [330/555], Loss: 2.7429 Epoch [2/5], Step [340/555], Loss: 2.9390 Epoch [2/5], Step [350/555], Loss: 2.7032 Epoch [2/5], Step [360/555], Loss: 2.9923 Epoch [2/5], Step [370/555], Loss: 2.9808 Epoch [2/5], Step [380/555], Loss: 3.1404 Epoch [2/5], Step [390/555], Loss: 2.6345 Epoch [2/5], Step [400/555], Loss: 2.9166 Epoch [2/5], Step [410/555], Loss: 2.6501 Epoch [2/5], Step [420/555], Loss: 2.6376 Epoch [2/5], Step [430/555], Loss: 2.7707 Epoch [2/5], Step [440/555], Loss: 2.6727 Epoch [2/5], Step [450/555], Loss: 3.0119 Epoch [2/5], Step [460/555], Loss: 2.8192 Epoch [2/5], Step [470/555], Loss: 3.0097 Epoch [2/5], Step [480/555], Loss: 2.8193 Epoch [2/5], Step [490/555], Loss: 3.1907 Epoch [2/5], Step [500/555], Loss: 2.6176 Epoch [2/5], Step [510/555], Loss: 2.6274 Epoch [2/5], Step [520/555], Loss: 2.8070 Epoch [2/5], Step [530/555], Loss: 2.8745 Epoch [2/5], Step [540/555], Loss: 2.9093 Epoch [2/5], Step [550/555], Loss: 2.6352 ============================== Epoch [2/5], Train_Loss: 2.959, Valid_Loss: 2.905 ============================== Epoch [3/5], Step [0/555], Loss: 2.8692 Epoch [3/5], Step [10/555], Loss: 2.7235 Epoch [3/5], Step [20/555], Loss: 2.4602 Epoch [3/5], Step [30/555], Loss: 2.2587 Epoch [3/5], Step [40/555], Loss: 2.6053 Epoch [3/5], Step [50/555], Loss: 2.5270 Epoch [3/5], Step [60/555], Loss: 2.5679 Epoch [3/5], Step [70/555], Loss: 2.7093 Epoch [3/5], Step [80/555], Loss: 2.6740 Epoch [3/5], Step [90/555], Loss: 2.7479 Epoch [3/5], Step [100/555], Loss: 2.6020 Epoch [3/5], Step [110/555], Loss: 2.5993 Epoch [3/5], Step [120/555], Loss: 2.5475 Epoch [3/5], Step [130/555], Loss: 2.5216 Epoch [3/5], Step [140/555], Loss: 2.7880 Epoch [3/5], Step [150/555], Loss: 2.6105 Epoch [3/5], Step [160/555], Loss: 2.4912 Epoch [3/5], Step [170/555], Loss: 2.4612 Epoch [3/5], Step [180/555], Loss: 2.5992 Epoch [3/5], Step [190/555], Loss: 2.6169 Epoch [3/5], Step [200/555], Loss: 2.8792 Epoch [3/5], Step [210/555], Loss: 2.7122 Epoch [3/5], Step [220/555], Loss: 2.6859 Epoch [3/5], Step [230/555], Loss: 2.7913 Epoch [3/5], Step [240/555], Loss: 2.7562 Epoch [3/5], Step [250/555], Loss: 2.4871 Epoch [3/5], Step [260/555], Loss: 2.5633 Epoch [3/5], Step [270/555], Loss: 2.8350 Epoch [3/5], Step [280/555], Loss: 2.6313 Epoch [3/5], Step [290/555], Loss: 2.3898 Epoch [3/5], Step [300/555], Loss: 2.8296 Epoch [3/5], Step [310/555], Loss: 2.2790 Epoch [3/5], Step [320/555], Loss: 2.5435 Epoch [3/5], Step [330/555], Loss: 2.8326 Epoch [3/5], Step [340/555], Loss: 2.6305 Epoch [3/5], Step [350/555], Loss: 2.4686 Epoch [3/5], Step [360/555], Loss: 2.9329 Epoch [3/5], Step [370/555], Loss: 2.5634 Epoch [3/5], Step [380/555], Loss: 2.6731 Epoch [3/5], Step [390/555], Loss: 2.6132 Epoch [3/5], Step [400/555], Loss: 2.7384 Epoch [3/5], Step [410/555], Loss: 2.5959 Epoch [3/5], Step [420/555], Loss: 2.5320 Epoch [3/5], Step [430/555], Loss: 2.6336 Epoch [3/5], Step [440/555], Loss: 2.7148 Epoch [3/5], Step [450/555], Loss: 2.7327 Epoch [3/5], Step [460/555], Loss: 2.5302 Epoch [3/5], Step [470/555], Loss: 2.6247 Epoch [3/5], Step [480/555], Loss: 2.4316 Epoch [3/5], Step [490/555], Loss: 2.7193 Epoch [3/5], Step [500/555], Loss: 2.5534 Epoch [3/5], Step [510/555], Loss: 2.5192 Epoch [3/5], Step [520/555], Loss: 2.4733 Epoch [3/5], Step [530/555], Loss: 2.3970 Epoch [3/5], Step [540/555], Loss: 2.8504 Epoch [3/5], Step [550/555], Loss: 2.7827 ============================== Epoch [3/5], Train_Loss: 2.593, Valid_Loss: 2.794 ============================== Epoch [4/5], Step [0/555], Loss: 2.2864 Epoch [4/5], Step [10/555], Loss: 2.4969 Epoch [4/5], Step [20/555], Loss: 2.1480 Epoch [4/5], Step [30/555], Loss: 2.1666 Epoch [4/5], Step [40/555], Loss: 2.0271 Epoch [4/5], Step [50/555], Loss: 2.0507 Epoch [4/5], Step [60/555], Loss: 2.4967 Epoch [4/5], Step [70/555], Loss: 2.3147 Epoch [4/5], Step [80/555], Loss: 2.1637 Epoch [4/5], Step [90/555], Loss: 2.1456 Epoch [4/5], Step [100/555], Loss: 2.5404 Epoch [4/5], Step [110/555], Loss: 2.2960 Epoch [4/5], Step [120/555], Loss: 2.4430 Epoch [4/5], Step [130/555], Loss: 2.3058 Epoch [4/5], Step [140/555], Loss: 2.2902 Epoch [4/5], Step [150/555], Loss: 2.0610 Epoch [4/5], Step [160/555], Loss: 2.5923 Epoch [4/5], Step [170/555], Loss: 2.1776 Epoch [4/5], Step [180/555], Loss: 2.2819 Epoch [4/5], Step [190/555], Loss: 2.4169 Epoch [4/5], Step [200/555], Loss: 2.3932 Epoch [4/5], Step [210/555], Loss: 2.3808 Epoch [4/5], Step [220/555], Loss: 2.2253 Epoch [4/5], Step [230/555], Loss: 2.3364 Epoch [4/5], Step [240/555], Loss: 2.2917 Epoch [4/5], Step [250/555], Loss: 2.1361 Epoch [4/5], Step [260/555], Loss: 2.2846 Epoch [4/5], Step [270/555], Loss: 2.1695 Epoch [4/5], Step [280/555], Loss: 2.3894 Epoch [4/5], Step [290/555], Loss: 2.3552 Epoch [4/5], Step [300/555], Loss: 2.2022 Epoch [4/5], Step [310/555], Loss: 2.1276 Epoch [4/5], Step [320/555], Loss: 2.3865 Epoch [4/5], Step [330/555], Loss: 2.4303 Epoch [4/5], Step [340/555], Loss: 2.2350 Epoch [4/5], Step [350/555], Loss: 2.4896 Epoch [4/5], Step [360/555], Loss: 2.2379 Epoch [4/5], Step [370/555], Loss: 2.3915 Epoch [4/5], Step [380/555], Loss: 2.4632 Epoch [4/5], Step [390/555], Loss: 2.4079 Epoch [4/5], Step [400/555], Loss: 2.4554 Epoch [4/5], Step [410/555], Loss: 2.3514 Epoch [4/5], Step [420/555], Loss: 2.4363 Epoch [4/5], Step [430/555], Loss: 2.5182 Epoch [4/5], Step [440/555], Loss: 2.1137 Epoch [4/5], Step [450/555], Loss: 2.4409 Epoch [4/5], Step [460/555], Loss: 2.5121 Epoch [4/5], Step [470/555], Loss: 2.3455 Epoch [4/5], Step [480/555], Loss: 2.4318 Epoch [4/5], Step [490/555], Loss: 2.2733 Epoch [4/5], Step [500/555], Loss: 2.4172 Epoch [4/5], Step [510/555], Loss: 2.3140 Epoch [4/5], Step [520/555], Loss: 2.5247 Epoch [4/5], Step [530/555], Loss: 2.3714 Epoch [4/5], Step [540/555], Loss: 2.1993 Epoch [4/5], Step [550/555], Loss: 2.3816 ============================== Epoch [4/5], Train_Loss: 2.32, Valid_Loss: 2.792 ============================== Epoch [5/5], Step [0/555], Loss: 2.1735 Epoch [5/5], Step [10/555], Loss: 2.0777 Epoch [5/5], Step [20/555], Loss: 2.1122 Epoch [5/5], Step [30/555], Loss: 2.0355 Epoch [5/5], Step [40/555], Loss: 1.8369 Epoch [5/5], Step [50/555], Loss: 2.1056 Epoch [5/5], Step [60/555], Loss: 2.1689 Epoch [5/5], Step [70/555], Loss: 2.0049 Epoch [5/5], Step [80/555], Loss: 2.1005 Epoch [5/5], Step [90/555], Loss: 1.9689 Epoch [5/5], Step [100/555], Loss: 2.1283 Epoch [5/5], Step [110/555], Loss: 2.0581 Epoch [5/5], Step [120/555], Loss: 2.0816 Epoch [5/5], Step [130/555], Loss: 1.9813 Epoch [5/5], Step [140/555], Loss: 1.9503 Epoch [5/5], Step [150/555], Loss: 2.2365 Epoch [5/5], Step [160/555], Loss: 2.2017 Epoch [5/5], Step [170/555], Loss: 2.1390 Epoch [5/5], Step [180/555], Loss: 2.2415 Epoch [5/5], Step [190/555], Loss: 2.0828 Epoch [5/5], Step [200/555], Loss: 2.2286 Epoch [5/5], Step [210/555], Loss: 2.1885 Epoch [5/5], Step [220/555], Loss: 1.9793 Epoch [5/5], Step [230/555], Loss: 2.1285 Epoch [5/5], Step [240/555], Loss: 2.0170 Epoch [5/5], Step [250/555], Loss: 2.0941 Epoch [5/5], Step [260/555], Loss: 2.0616 Epoch [5/5], Step [270/555], Loss: 2.0649 Epoch [5/5], Step [280/555], Loss: 2.1102 Epoch [5/5], Step [290/555], Loss: 2.0377 Epoch [5/5], Step [300/555], Loss: 2.2401 Epoch [5/5], Step [310/555], Loss: 2.1244 Epoch [5/5], Step [320/555], Loss: 2.2139 Epoch [5/5], Step [330/555], Loss: 2.2095 Epoch [5/5], Step [340/555], Loss: 1.8742 Epoch [5/5], Step [350/555], Loss: 2.0079 Epoch [5/5], Step [360/555], Loss: 2.1388 Epoch [5/5], Step [370/555], Loss: 1.9578 Epoch [5/5], Step [380/555], Loss: 2.1875 Epoch [5/5], Step [390/555], Loss: 2.0440 Epoch [5/5], Step [400/555], Loss: 2.2028 Epoch [5/5], Step [410/555], Loss: 2.1402 Epoch [5/5], Step [420/555], Loss: 2.1425 Epoch [5/5], Step [430/555], Loss: 1.8858 Epoch [5/5], Step [440/555], Loss: 1.9676 Epoch [5/5], Step [450/555], Loss: 2.1058 Epoch [5/5], Step [460/555], Loss: 2.1324 Epoch [5/5], Step [470/555], Loss: 1.9069 Epoch [5/5], Step [480/555], Loss: 2.2812 Epoch [5/5], Step [490/555], Loss: 1.9010 Epoch [5/5], Step [500/555], Loss: 2.1300 Epoch [5/5], Step [510/555], Loss: 2.1192 Epoch [5/5], Step [520/555], Loss: 2.1836 Epoch [5/5], Step [530/555], Loss: 2.1848 Epoch [5/5], Step [540/555], Loss: 2.1998 Epoch [5/5], Step [550/555], Loss: 2.1861 ============================== Epoch [5/5], Train_Loss: 2.079, Valid_Loss: 2.8 ==============================
fig = plt.figure()
plt.plot(stats[:,0], 'r', label = 'training', )
plt.plot(stats[:,1], 'g', label = 'validation' )
plt.legend()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.title(f"COCO dataset- train vocab only #vocab={len(vocab)} - 5 epochs")
fig.savefig("coco_train_vocab_only.png")
plt.show()
# save model after training
decoder_ckpt = torch.save(decoder, "coco_subset_assessment_decoder.ckpt")
6 Test prediction and evaluation [10 marks]¶
6.1 Generate predictions on test data (4 marks)¶
Display 4 sample test images containing different objects, along with your model’s generated captions and all the reference captions for each.
Remember that everything displayed in the submitted notebook and .html file will be marked, so be sure to run all relevant cells.
from PIL import Image
class COCOImagesDataset(Dataset):
def __init__(self, df, transform=None):
"""
Args:
df (DataFrame): DataFrame containing image paths and captions.
transform (callable, optional): Optional transform to be applied on an image.
"""
self.df = df
self.transform = transform
self.pth = 'data/COMP5625M_data_assessment_2/coco/images'
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
img_name = self.df.iloc[idx, 3]
caption = self.df.iloc[idx, 4]
img_path = os.path.join(self.pth, img_name)
image = Image.open(img_path).convert('RGB')
if self.transform:
image = self.transform(image)
return image, caption
data_transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), # using ImageNet norms
(0.229, 0.224, 0.225))])
dataset_test = COCOImagesDataset(
df=test_set,
transform=data_transform,
)
test_loader = torch.utils.data.DataLoader(
dataset_test,
batch_size=32,
shuffle=False,
num_workers=0)
model.eval()
decoder.eval() # generate caption, eval mode to not influence batchnormncoder.eval()
DecoderRNN( (resize): Linear(in_features=2048, out_features=256, bias=True) (bn): BatchNorm1d(256, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True) (embed): Embedding(2510, 256) (rnn): LSTM(256, 1024, num_layers=2, batch_first=True, dropout=0.5) (linear): Linear(in_features=1024, out_features=2510, bias=True) )
# getting functions from helperDL.py
from helperDL import timshow
from helperDL import decode_caption
IMAGES_TO_SHOW = 10
idx = 0
with torch.no_grad():
for i, (image,filename) in enumerate(test_loader):
# your code here --->
image = image.to(device)
features = model(image).to(device)
sampled_ids = decoder.sample(features)
sampled_ids = sampled_ids.cpu().numpy()
idy = 0
for i in range(sampled_ids.shape[0]):
predicted_ids = sampled_ids[i]
predicted_ids_trimmed = predicted_ids[(predicted_ids != vocab('<start>')) & (predicted_ids != vocab('<end>')) & (predicted_ids != vocab('<pad>'))]
# Decode the caption
generated_caption = decode_caption(predicted_ids_trimmed, vocab)
idy += 1
if idx == 1:
break
# Display the generated caption
print(f"GENERATED: {generated_caption}\n")
break
#print(f"REFERENCES: {filename[0]}\n")
for k in range(4):
print(f"REFERENCE:{filename[k]}\n")
print("===================================\n")
timshow(image[0].cpu())
print("===================================\n")
idx +=1
if idx == IMAGES_TO_SHOW:
break
GENERATED: a man riding a motorcycle on top of a REFERENCE:some people in the woods riding two elephants REFERENCE:several elephants in the jungle carrying people on their backs REFERENCE:some people who are riding on top of elephants REFERENCE:they are brave for riding in the jungle on those elephants ===================================
=================================== REFERENCE:a small pizza that is on a white plate REFERENCE:a small personal sized pizza is shown on display REFERENCE:the pizza is loaded with lots of toppings REFERENCE:the woman is standing with her luggage ===================================
=================================== GENERATED: a person is holding a banana and a beverage REFERENCE:a pair of gray scissors hanging on a nail and another black item REFERENCE:a brown and black dog is laying on a roof REFERENCE:a large dog sitting on top of a roof REFERENCE:a dog is on the roof of a building ===================================
=================================== GENERATED: a group of people flying kites in a field REFERENCE:a crowd of people standing around a dry grass covered field REFERENCE:groups of people in a field sitting down standing and walking with tent areas in the background REFERENCE:people gather in the grass beneath cloudy skies for some sort of a patriotic event REFERENCE:a crowd of people on a beach watching kites fly ===================================
=================================== GENERATED: a baseball player sliding at a base ball game REFERENCE:an umpire is catching a baseball that was missed by the batter REFERENCE:a baseball player holding a baseball bat during a game REFERENCE:several people ride surfboards in the ocean waves REFERENCE:three surfers can be seen trying to catch a wave ===================================
=================================== GENERATED: a motorcycle parked in a parking lot next to a fire hydrant water REFERENCE:a custom made motorcycle with a trailer and two seats REFERENCE:a motorcycle parked in a parking space next to a red car REFERENCE:a motorcycle with attached seats outside a building REFERENCE:a large customized motorcycle sitting in a parking space ===================================
=================================== GENERATED: a man in a tennis match is swinging his racket at a tennis ball REFERENCE:two men standing next to each other holding tennis racquets REFERENCE:two men pose for a picture while holding rackets REFERENCE:two men on a tennis court with tennis rackets REFERENCE:modern workspace with dual monitors microphones and other electronic equipment ===================================
=================================== GENERATED: a fire hydrant is sitting on the side of a street REFERENCE:a redwhieandblue fire hydrant is on the street in front of houses REFERENCE:a platter of grilled vegetables on a table with a woman seated REFERENCE:a plate on a table topped with lots of vegetables REFERENCE:a long tray filled with a lot of fruits and veggies on it ===================================
=================================== GENERATED: a yellow and white bus on a city street REFERENCE:a passenger bus travels down a road straddling the center line REFERENCE:a bus driving down a street with people seated on the roof of the bus REFERENCE:a bus is traveling down the road with many passengers REFERENCE:a bus with people riding on the roof ===================================
=================================== GENERATED: a suitcase sitting on the ground next to a wooden bench on a sidewalk REFERENCE:a stack of luggage by a curb and parked car REFERENCE:a large amount of luggage sitting near a car REFERENCE:a piece of bread holding various foods that include sauce and a salad REFERENCE:some vegetables beans and sauce and other food ===================================
===================================
6.2 Caption evaluation using cosine similarity (6 marks)¶
The cosine similarity measures the cosine of the angle between two vectors in n-dimensional space. The smaller the angle, the greater the similarity.
To use the cosine similarity to measure the similarity between the generated caption and the reference captions:
- Find the embedding vector of each word in the caption
- Compute the average vector for each caption
- Compute the cosine similarity score between the average vector of the generated caption and average vector of each reference caption
- Compute the average of these scores
Calculate the cosine similarity using the model's predictions over the whole test set.
Display a histogram of the distribution of scores over the test set.
# your code here
from torch.nn.functional import cosine_similarity
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine_similarity
word_embeddings = decoder.embed.weight.data.cpu().numpy()
def get_caption_embedding(caption, vocab, word_embeddings):
# Convert caption to list of indices
indices = [vocab.word2idx[word] for word in caption.split() if word in vocab.word2idx]
# Retrieve the corresponding embeddings and take the mean
if len(indices) == 0:
return np.zeros(word_embeddings.shape[1])
caption_embedding = np.mean(word_embeddings[indices], axis=0)
return caption_embedding
cosine_scores = []
# Loop through the test dataset
for images, true_captions in test_loader.dataset:
if images.ndim == 3:
images = images.unsqueeze(0) # Add batch dimension at the beginning
images = images.to(device)
# Generate captions for the images
features = model(images)
sampled_ids = decoder.sample(features)
sampled_ids = sampled_ids.cpu().numpy()
# Decode the generated captions
generated_captions = [decode_caption(sample, vocab) for sample in sampled_ids]
generated_embeddings = np.array([get_caption_embedding(cap, vocab, word_embeddings) for cap in generated_captions])
# Get embeddings for the true captions
true_embeddings = np.array([get_caption_embedding(cap, vocab, word_embeddings) for cap in true_captions])
# Compute the cosine similarity
for gen_emb, true_emb in zip(generated_embeddings, true_embeddings):
similarity = sklearn_cosine_similarity([gen_emb], [true_emb])
cosine_scores.append(similarity.item())
# Display a histogram of the distribution of cosine similarity scores
plt.hist(cosine_scores, bins=20)
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')
plt.title('Distribution of Cosine Similarity Scores')
plt.show()